2014-11-24 04:14:18

by Kent Overstreet

[permalink] [raw]
Subject: Block layer projects that I haven't had time for

Since I'm starting to resign myself to the fact that I'm probably not going to
have much time for upstream development again any time soon, I figured maybe I
should try writing down all the things I was working on or planning on working
on in case someone else is feeling ambitious and looking for things to work on.

If anyone wants to take up any of this stuff, feel free to take my half baked
code and do whatever you want with it, or ping me for ideas/guidance.

- immutable biovecs went in, but what this was leading up to was making
generic_make_request() accept arbitrary size bios, and pushing the splitting
down to the drivers or wherever it's required.

This is a performance win, and a big reduction in complexity and allows a lot
of code to be deleted. The performance win is because bio_add_page() doesn't
have to check anything except "does this page fit in the current bio" -
checking queue limits is like multiple cache misses. That stuff isn't checked
until the driver level - when the relevant stuff is going to be in cache
anyways - and usually bios won't have to be split. If they do have to be
split, it's quite cheap now.

I actually benchmarked the impact of this with fio on a micron p320h, it's
definitely a measurable impact.

It's also the last thing needed for the dio rewrite I was working on (god,
who knows when I'll have time for _that_, the code is mostly done :/) - and
the performance impact of that is _very_ significant.

- making generic_make_request() take arbitrary size bios means we can delete
merge_bvec_fn, which deletes over 1k loc. This is done in my tree, needs
rebasing and testing.

- kill bio->bi_cnt

I added bi_remaning and bio_chain() awhile back - but now we have two atomic
refcounts in struct bio and really we don't need both, bi_remaining is more
general.

If you grep there aren't that many uses of bio_get(), most of them are
straightforward to get rid of but there were one or two tricky ones. Don't
remember which ones, though.

- plugging

that code in generic_make_request() that turns recursion into iteration - if
you squint, what's really going on is that it's another plugging
implementation.

What I'd like to do (only started playing with this) is rework the existing
plugging to work in terms of bios, not requests - I think this would simplify
things, and would allow non request based drivers to take advantage of
plugging (it'd be useful for icache if nothing else).

Then, replace the open coded plugging in generic_make_request() with a normal
plug, and in the scheduler hook (where right now we would recurse and
potentially blow the stack if we did this) - check the current stack usage,
and if it's over some threshold punt the bios to per request queue
workqueues.

If anyone remembers the hack I added to bio_alloc_bioset() awhile back (where
if we're about to block on allocating from the mempool, we punt any bios
stranded on current->bio_list to workqueues - so as to avoid deadlocking) -
this would actually replace that hack.

- multipage bvecs

I did a lot of the work to implement this _ages_ ago, it turns out to not be
that bad it terms of amount of code that has to be changed. The trick is, we
just add a new bio_for_each_page() macro - analagous to
bio_for_each_segment() - that iterates over each page in a bvec separately;
that way we don't have to modify all the code that expects bios to contain
single pages.

One of the reasons this is nice is because we can move segment merging up to
bio_add_page(). Conceptually, right now we're breaking an IO up into single
page segments to submit it in only for the lower layers to undo that work,
and merge the segments back together. It's a lot simpler to just submit IOs
with segments already merged; this does mean that a driver (when it calls
blk_bio_map_sg()) will potentially have to split segments that are too big
for the device limits, but remember we want to push bio splitting down to the
driver anyways so this is actually completely trivial - the model is just
that the driver incrementally consumes the bio/request.

This is nice for the upper layers in small ways too, and might help to enable
other changes we want but I have only a hazy idea of what those might be.

- my dio rewrite, if anyone is feeling really ambitious

If anyone wants to take a look at my (mostly mostly quite messy, and out of
date) in progress work - it's in a branch:

http://evilpiepirate.org/git/linux-bcache.git block_stuff


2014-12-04 11:00:35

by Dongsu Park

[permalink] [raw]
Subject: Re: Block layer projects that I haven't had time for

Hi Kent,

On 23.11.2014 20:16, Kent Overstreet wrote:
> Since I'm starting to resign myself to the fact that I'm probably not going to
> have much time for upstream development again any time soon, I figured maybe I
> should try writing down all the things I was working on or planning on working
> on in case someone else is feeling ambitious and looking for things to work on.
>
> If anyone wants to take up any of this stuff, feel free to take my half baked
> code and do whatever you want with it, or ping me for ideas/guidance.

I'm interested in taking up your work and further implement it.
IMHO the 1st and 2nd items, making generic_make_request() take arbitrarily
sized bios, are the essential ones. With those changes, individual block
drivers wouldn't have to define ->merge_bvec_fn() any more.

Playing a little with your block_stuff tree based on 3.15, however,
I think there still seems to be a couple of issues.
First of all, it doesn't work with virtio-blk. A testing Qemu VM panics
at the very early stage of booting. This issue should be addressed as
the first step, so that other parts can be tested.

Moreover, I've already tried to rebase these patches on top of current
mainline, 3.18-rc7. It's now compilable, but it seems to introduce
more bugs about direct-IO. I didn't manage to find out the reason.
I'd need to also look at the previous review comments in [1], [2].

Don't you have other trees based on top of 3.17 or higher?
If not, can I create my own tree based on 3.18-rc7 to publish?

Thanks,
Dongsu

[1] https://lkml.org/lkml/2013/11/25/732
[2] https://lkml.org/lkml/2014/2/26/618

> - immutable biovecs went in, but what this was leading up to was making
> generic_make_request() accept arbitrary size bios, and pushing the splitting
> down to the drivers or wherever it's required.
>
> This is a performance win, and a big reduction in complexity and allows a lot
> of code to be deleted. The performance win is because bio_add_page() doesn't
> have to check anything except "does this page fit in the current bio" -
> checking queue limits is like multiple cache misses. That stuff isn't checked
> until the driver level - when the relevant stuff is going to be in cache
> anyways - and usually bios won't have to be split. If they do have to be
> split, it's quite cheap now.
>
> I actually benchmarked the impact of this with fio on a micron p320h, it's
> definitely a measurable impact.
>
> It's also the last thing needed for the dio rewrite I was working on (god,
> who knows when I'll have time for _that_, the code is mostly done :/) - and
> the performance impact of that is _very_ significant.
>
> - making generic_make_request() take arbitrary size bios means we can delete
> merge_bvec_fn, which deletes over 1k loc. This is done in my tree, needs
> rebasing and testing.
>
> - kill bio->bi_cnt
>
> I added bi_remaning and bio_chain() awhile back - but now we have two atomic
> refcounts in struct bio and really we don't need both, bi_remaining is more
> general.
>
> If you grep there aren't that many uses of bio_get(), most of them are
> straightforward to get rid of but there were one or two tricky ones. Don't
> remember which ones, though.
>
> - plugging
>
> that code in generic_make_request() that turns recursion into iteration - if
> you squint, what's really going on is that it's another plugging
> implementation.
>
> What I'd like to do (only started playing with this) is rework the existing
> plugging to work in terms of bios, not requests - I think this would simplify
> things, and would allow non request based drivers to take advantage of
> plugging (it'd be useful for icache if nothing else).
>
> Then, replace the open coded plugging in generic_make_request() with a normal
> plug, and in the scheduler hook (where right now we would recurse and
> potentially blow the stack if we did this) - check the current stack usage,
> and if it's over some threshold punt the bios to per request queue
> workqueues.
>
> If anyone remembers the hack I added to bio_alloc_bioset() awhile back (where
> if we're about to block on allocating from the mempool, we punt any bios
> stranded on current->bio_list to workqueues - so as to avoid deadlocking) -
> this would actually replace that hack.
>
> - multipage bvecs
>
> I did a lot of the work to implement this _ages_ ago, it turns out to not be
> that bad it terms of amount of code that has to be changed. The trick is, we
> just add a new bio_for_each_page() macro - analagous to
> bio_for_each_segment() - that iterates over each page in a bvec separately;
> that way we don't have to modify all the code that expects bios to contain
> single pages.
>
> One of the reasons this is nice is because we can move segment merging up to
> bio_add_page(). Conceptually, right now we're breaking an IO up into single
> page segments to submit it in only for the lower layers to undo that work,
> and merge the segments back together. It's a lot simpler to just submit IOs
> with segments already merged; this does mean that a driver (when it calls
> blk_bio_map_sg()) will potentially have to split segments that are too big
> for the device limits, but remember we want to push bio splitting down to the
> driver anyways so this is actually completely trivial - the model is just
> that the driver incrementally consumes the bio/request.
>
> This is nice for the upper layers in small ways too, and might help to enable
> other changes we want but I have only a hazy idea of what those might be.
>
> - my dio rewrite, if anyone is feeling really ambitious
>
> If anyone wants to take a look at my (mostly mostly quite messy, and out of
> date) in progress work - it's in a branch:
>
> http://evilpiepirate.org/git/linux-bcache.git block_stuff
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2014-12-06 02:59:48

by Kent Overstreet

[permalink] [raw]
Subject: Re: Block layer projects that I haven't had time for

On Thu, Dec 04, 2014 at 12:00:27PM +0100, Dongsu Park wrote:
> Hi Kent,
>
> On 23.11.2014 20:16, Kent Overstreet wrote:
> > Since I'm starting to resign myself to the fact that I'm probably not going to
> > have much time for upstream development again any time soon, I figured maybe I
> > should try writing down all the things I was working on or planning on working
> > on in case someone else is feeling ambitious and looking for things to work on.
> >
> > If anyone wants to take up any of this stuff, feel free to take my half baked
> > code and do whatever you want with it, or ping me for ideas/guidance.
>
> I'm interested in taking up your work and further implement it.
> IMHO the 1st and 2nd items, making generic_make_request() take arbitrarily
> sized bios, are the essential ones. With those changes, individual block
> drivers wouldn't have to define ->merge_bvec_fn() any more.

Cool! Yeah, things get a lot simpler for a lot of code.

> Playing a little with your block_stuff tree based on 3.15, however,
> I think there still seems to be a couple of issues.
> First of all, it doesn't work with virtio-blk. A testing Qemu VM panics
> at the very early stage of booting. This issue should be addressed as
> the first step, so that other parts can be tested.

Really? I was testing with virtio-blk, that's odd..

> Moreover, I've already tried to rebase these patches on top of current
> mainline, 3.18-rc7. It's now compilable, but it seems to introduce
> more bugs about direct-IO. I didn't manage to find out the reason.
> I'd need to also look at the previous review comments in [1], [2].
>
> Don't you have other trees based on top of 3.17 or higher?
> If not, can I create my own tree based on 3.18-rc7 to publish?

Yeah, I'd post what you have now and I'll try and take a look.

2014-12-08 11:48:20

by Dongsu Park

[permalink] [raw]
Subject: Re: Block layer projects that I haven't had time for

Thanks for the reply.

On 05.12.2014 19:02, Kent Overstreet wrote:
> On Thu, Dec 04, 2014 at 12:00:27PM +0100, Dongsu Park wrote:
> > Playing a little with your block_stuff tree based on 3.15, however,
> > I think there still seems to be a couple of issues.
> > First of all, it doesn't work with virtio-blk. A testing Qemu VM panics
> > at the very early stage of booting. This issue should be addressed as
> > the first step, so that other parts can be tested.
>
> Really? I was testing with virtio-blk, that's odd..

The culprit seems to be the plugging commit.
Before that change, it works well also with virtio-blk.
Though that's not the only issue...

> > Moreover, I've already tried to rebase these patches on top of current
> > mainline, 3.18-rc7. It's now compilable, but it seems to introduce
> > more bugs about direct-IO. I didn't manage to find out the reason.
> > I'd need to also look at the previous review comments in [1], [2].
> >
> > Don't you have other trees based on top of 3.17 or higher?
> > If not, can I create my own tree based on 3.18-rc7 to publish?
>
> Yeah, I'd post what you have now and I'll try and take a look.

I've created a git tree to include what I have right now.
Please see <https://github.com/dongsupark/linux>.

To be able to handle different issues one by one,
I got the entire tree separated out into 4 branches based on 3.18.

* block-generic-req-for-next : the most stable branch you can test with.
With this branch, you can test most of block drivers as well as file
systems with less critical bugs. Though it's not 100% perfect yet,
e.g. btrfs doesn't seem to work quite well. Thus more tests are needed.

* block-mpage-bvecs-for-next : block-generic-req-for-next + multipage bvecs.
This branch shows a critical issue that writing blocks to ext4 rootfs
causes the whole system to crash. Need-to-investigate.

* block-plug-for-next: block-mpage-bvecs-for-next + plugging.
This branch has an additional bug with virtio-blk, that the kernel
panics at the very early stage of booting. Need-to-investigate.

* block-dio-rewrite-for-next: block-plug-for-next + dio-rewriting.
This branch has more issues w.r.t. direct-io. For example, dio_init()
causes the kernel to panic at the early stage of booting.

All branches are compilable. But it's still somehow half-way complete.
Commit messages should be properly written too, so that they can be posted
to mailing lists.

Regards,
Dongsu

2014-12-10 22:47:53

by Ming Lin

[permalink] [raw]
Subject: Re: Block layer projects that I haven't had time for

On Mon, Dec 8, 2014 at 3:48 AM, Dongsu Park
<[email protected]> wrote:
> Thanks for the reply.
>
> On 05.12.2014 19:02, Kent Overstreet wrote:
>> On Thu, Dec 04, 2014 at 12:00:27PM +0100, Dongsu Park wrote:
>> > Playing a little with your block_stuff tree based on 3.15, however,
>> > I think there still seems to be a couple of issues.
>> > First of all, it doesn't work with virtio-blk. A testing Qemu VM panics
>> > at the very early stage of booting. This issue should be addressed as
>> > the first step, so that other parts can be tested.
>>
>> Really? I was testing with virtio-blk, that's odd..
>
> The culprit seems to be the plugging commit.
> Before that change, it works well also with virtio-blk.
> Though that's not the only issue...
>
>> > Moreover, I've already tried to rebase these patches on top of current
>> > mainline, 3.18-rc7. It's now compilable, but it seems to introduce
>> > more bugs about direct-IO. I didn't manage to find out the reason.
>> > I'd need to also look at the previous review comments in [1], [2].
>> >
>> > Don't you have other trees based on top of 3.17 or higher?
>> > If not, can I create my own tree based on 3.18-rc7 to publish?
>>
>> Yeah, I'd post what you have now and I'll try and take a look.
>
> I've created a git tree to include what I have right now.
> Please see <https://github.com/dongsupark/linux>.
>
> To be able to handle different issues one by one,
> I got the entire tree separated out into 4 branches based on 3.18.
>
> * block-generic-req-for-next : the most stable branch you can test with.
> With this branch, you can test most of block drivers as well as file
> systems with less critical bugs. Though it's not 100% perfect yet,
> e.g. btrfs doesn't seem to work quite well. Thus more tests are needed.
>
> * block-mpage-bvecs-for-next : block-generic-req-for-next + multipage bvecs.
> This branch shows a critical issue that writing blocks to ext4 rootfs
> causes the whole system to crash. Need-to-investigate.

I tried block-mpage-bvecs-for-next branch on qemu-kvm with ext4 rootfs.
Run "sync" will stuck in kernel.

[ 480.751901] INFO: task sync:4424 blocked for more than 120 seconds.
[ 480.753064] Not tainted 3.18.0-00025-g46c8231 #39
[ 480.753720] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 480.754737] sync D ffff88001fc11180 0 4424 4338 0x00000000
[ 480.755719] ffff88001cdfbc98 0000000000000086 ffff88001cdfbba8
ffff880014adefc0
[ 480.756810] 0000000000011180 0000000000004000 ffffffff81813460
ffff880014adefc0
[ 480.758102] ffff88001cdfbbc8 ffffffff812f08be ffff88001cdfbc18
ffff880014adf028
[ 480.759454] Call Trace:
[ 480.759852] [<ffffffff812f08be>] ? debug_smp_processor_id+0x17/0x19
[ 480.760609] [<ffffffff8106093e>] ? __enqueue_entity+0x69/0x6b
[ 480.761318] [<ffffffff8106017e>] ? __dequeue_entity+0x33/0x38
[ 480.762026] [<ffffffff810601ab>] ? set_next_entity+0x28/0x7d
[ 480.762739] [<ffffffff8105a4fb>] ? get_parent_ip+0xf/0x3f
[ 480.763425] [<ffffffff8108562b>] ? ktime_get+0x50/0x8f
[ 480.763848] [<ffffffff8148abdb>] ? bit_wait_timeout+0x60/0x60
[ 480.764555] [<ffffffff8148a6be>] schedule+0x6a/0x6c
[ 480.765186] [<ffffffff8148a74f>] io_schedule+0x8f/0xcd
[ 480.765841] [<ffffffff8148ac19>] bit_wait_io+0x3e/0x42
[ 480.766493] [<ffffffff8148ae80>] __wait_on_bit+0x4d/0x86
[ 480.767183] [<ffffffff810d4302>] ? find_get_pages_tag+0x106/0x133
[ 480.767847] [<ffffffff810d4a63>] wait_on_page_bit+0x76/0x78
[ 480.768532] [<ffffffff8106ab59>] ? wake_atomic_t_function+0x2d/0x2d
[ 480.769262] [<ffffffff810d511f>] filemap_fdatawait_range+0x7e/0x11d
[ 480.769992] [<ffffffff8148a639>] ? preempt_schedule+0x36/0x51
[ 480.770677] [<ffffffff8105a4fb>] ? get_parent_ip+0xf/0x3f
[ 480.771848] [<ffffffff810d51df>] filemap_fdatawait+0x21/0x23
[ 480.772530] [<ffffffff811458ce>] sync_inodes_sb+0x158/0x1aa
[ 480.773201] [<ffffffff81480303>] ? br_mdb_dump+0x225/0x495
[ 480.773885] [<ffffffff81149ad8>] ? fdatawrite_one_bdev+0x18/0x18
[ 480.774592] [<ffffffff81149aec>] sync_inodes_one_sb+0x14/0x16
[ 480.775278] [<ffffffff81125937>] iterate_supers+0x6f/0xc4
[ 480.775847] [<ffffffff81149bf4>] sys_sync+0x35/0x83
[ 480.776460] [<ffffffff8148da52>] system_call_fastpath+0x12/0x17


Here is a quick hack.

diff --git a/block/bio.c b/block/bio.c
index 4020ccc..fbc7108 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -829,6 +829,11 @@ int bio_add_page(struct bio *bio, struct page *page,
if (bvec_to_phys(bv) + bv->bv_len ==
page_to_phys(page) + offset) {
bv->bv_len += len;
+ /*
+ * Page is not added to bio vec.
+ * Clear PG_writeback so
filemap_fdatawait_range() won't wait for it.
+ */
+ TestClearPageWriteback(page);
goto done;
}
}

Thanks,
Ming

2014-12-10 22:49:27

by Kent Overstreet

[permalink] [raw]
Subject: Re: Block layer projects that I haven't had time for

On Mon, Dec 08, 2014 at 12:48:13PM +0100, Dongsu Park wrote:
> Thanks for the reply.
>
> On 05.12.2014 19:02, Kent Overstreet wrote:
> > On Thu, Dec 04, 2014 at 12:00:27PM +0100, Dongsu Park wrote:
> > > Playing a little with your block_stuff tree based on 3.15, however,
> > > I think there still seems to be a couple of issues.
> > > First of all, it doesn't work with virtio-blk. A testing Qemu VM panics
> > > at the very early stage of booting. This issue should be addressed as
> > > the first step, so that other parts can be tested.
> >
> > Really? I was testing with virtio-blk, that's odd..
>
> The culprit seems to be the plugging commit.
> Before that change, it works well also with virtio-blk.
> Though that's not the only issue...

Ohh - the plugging stuff was definitely broken and very preliminary. I'm
surprised that code even compiled (or did it?)

> > > Moreover, I've already tried to rebase these patches on top of current
> > > mainline, 3.18-rc7. It's now compilable, but it seems to introduce
> > > more bugs about direct-IO. I didn't manage to find out the reason.
> > > I'd need to also look at the previous review comments in [1], [2].
> > >
> > > Don't you have other trees based on top of 3.17 or higher?
> > > If not, can I create my own tree based on 3.18-rc7 to publish?
> >
> > Yeah, I'd post what you have now and I'll try and take a look.
>
> I've created a git tree to include what I have right now.
> Please see <https://github.com/dongsupark/linux>.
>
> To be able to handle different issues one by one,
> I got the entire tree separated out into 4 branches based on 3.18.
>
> * block-generic-req-for-next : the most stable branch you can test with.
> With this branch, you can test most of block drivers as well as file
> systems with less critical bugs. Though it's not 100% perfect yet,
> e.g. btrfs doesn't seem to work quite well. Thus more tests are needed.
>
> * block-mpage-bvecs-for-next : block-generic-req-for-next + multipage bvecs.
> This branch shows a critical issue that writing blocks to ext4 rootfs
> causes the whole system to crash. Need-to-investigate.
>
> * block-plug-for-next: block-mpage-bvecs-for-next + plugging.
> This branch has an additional bug with virtio-blk, that the kernel
> panics at the very early stage of booting. Need-to-investigate.
>
> * block-dio-rewrite-for-next: block-plug-for-next + dio-rewriting.
> This branch has more issues w.r.t. direct-io. For example, dio_init()
> causes the kernel to panic at the early stage of booting.
>
> All branches are compilable. But it's still somehow half-way complete.
> Commit messages should be properly written too, so that they can be posted
> to mailing lists.

Cool!

I'll take a look. I think the multipage bvec stuff still had a fair amount of
work to do, and getting the dio rewrite finished will be _hard_ (I was close to
having it working at one point, but xfstests was finding some subtle issues).

It'll be pretty awesome if we can get even some of this stuff in.

2014-12-10 22:57:14

by Kent Overstreet

[permalink] [raw]
Subject: Re: Block layer projects that I haven't had time for

On Wed, Dec 10, 2014 at 02:42:14PM -0800, Ming Lin wrote:
> On Mon, Dec 8, 2014 at 3:48 AM, Dongsu Park
> <[email protected]> wrote:
> > Thanks for the reply.
> >
> > On 05.12.2014 19:02, Kent Overstreet wrote:
> >> On Thu, Dec 04, 2014 at 12:00:27PM +0100, Dongsu Park wrote:
> >> > Playing a little with your block_stuff tree based on 3.15, however,
> >> > I think there still seems to be a couple of issues.
> >> > First of all, it doesn't work with virtio-blk. A testing Qemu VM panics
> >> > at the very early stage of booting. This issue should be addressed as
> >> > the first step, so that other parts can be tested.
> >>
> >> Really? I was testing with virtio-blk, that's odd..
> >
> > The culprit seems to be the plugging commit.
> > Before that change, it works well also with virtio-blk.
> > Though that's not the only issue...
> >
> >> > Moreover, I've already tried to rebase these patches on top of current
> >> > mainline, 3.18-rc7. It's now compilable, but it seems to introduce
> >> > more bugs about direct-IO. I didn't manage to find out the reason.
> >> > I'd need to also look at the previous review comments in [1], [2].
> >> >
> >> > Don't you have other trees based on top of 3.17 or higher?
> >> > If not, can I create my own tree based on 3.18-rc7 to publish?
> >>
> >> Yeah, I'd post what you have now and I'll try and take a look.
> >
> > I've created a git tree to include what I have right now.
> > Please see <https://github.com/dongsupark/linux>.
> >
> > To be able to handle different issues one by one,
> > I got the entire tree separated out into 4 branches based on 3.18.
> >
> > * block-generic-req-for-next : the most stable branch you can test with.
> > With this branch, you can test most of block drivers as well as file
> > systems with less critical bugs. Though it's not 100% perfect yet,
> > e.g. btrfs doesn't seem to work quite well. Thus more tests are needed.
> >
> > * block-mpage-bvecs-for-next : block-generic-req-for-next + multipage bvecs.
> > This branch shows a critical issue that writing blocks to ext4 rootfs
> > causes the whole system to crash. Need-to-investigate.
>
> I tried block-mpage-bvecs-for-next branch on qemu-kvm with ext4 rootfs.
> Run "sync" will stuck in kernel.
>
> [ 480.751901] INFO: task sync:4424 blocked for more than 120 seconds.
> [ 480.753064] Not tainted 3.18.0-00025-g46c8231 #39
> [ 480.753720] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [ 480.754737] sync D ffff88001fc11180 0 4424 4338 0x00000000
> [ 480.755719] ffff88001cdfbc98 0000000000000086 ffff88001cdfbba8
> ffff880014adefc0
> [ 480.756810] 0000000000011180 0000000000004000 ffffffff81813460
> ffff880014adefc0
> [ 480.758102] ffff88001cdfbbc8 ffffffff812f08be ffff88001cdfbc18
> ffff880014adf028
> [ 480.759454] Call Trace:
> [ 480.759852] [<ffffffff812f08be>] ? debug_smp_processor_id+0x17/0x19
> [ 480.760609] [<ffffffff8106093e>] ? __enqueue_entity+0x69/0x6b
> [ 480.761318] [<ffffffff8106017e>] ? __dequeue_entity+0x33/0x38
> [ 480.762026] [<ffffffff810601ab>] ? set_next_entity+0x28/0x7d
> [ 480.762739] [<ffffffff8105a4fb>] ? get_parent_ip+0xf/0x3f
> [ 480.763425] [<ffffffff8108562b>] ? ktime_get+0x50/0x8f
> [ 480.763848] [<ffffffff8148abdb>] ? bit_wait_timeout+0x60/0x60
> [ 480.764555] [<ffffffff8148a6be>] schedule+0x6a/0x6c
> [ 480.765186] [<ffffffff8148a74f>] io_schedule+0x8f/0xcd
> [ 480.765841] [<ffffffff8148ac19>] bit_wait_io+0x3e/0x42
> [ 480.766493] [<ffffffff8148ae80>] __wait_on_bit+0x4d/0x86
> [ 480.767183] [<ffffffff810d4302>] ? find_get_pages_tag+0x106/0x133
> [ 480.767847] [<ffffffff810d4a63>] wait_on_page_bit+0x76/0x78
> [ 480.768532] [<ffffffff8106ab59>] ? wake_atomic_t_function+0x2d/0x2d
> [ 480.769262] [<ffffffff810d511f>] filemap_fdatawait_range+0x7e/0x11d
> [ 480.769992] [<ffffffff8148a639>] ? preempt_schedule+0x36/0x51
> [ 480.770677] [<ffffffff8105a4fb>] ? get_parent_ip+0xf/0x3f
> [ 480.771848] [<ffffffff810d51df>] filemap_fdatawait+0x21/0x23
> [ 480.772530] [<ffffffff811458ce>] sync_inodes_sb+0x158/0x1aa
> [ 480.773201] [<ffffffff81480303>] ? br_mdb_dump+0x225/0x495
> [ 480.773885] [<ffffffff81149ad8>] ? fdatawrite_one_bdev+0x18/0x18
> [ 480.774592] [<ffffffff81149aec>] sync_inodes_one_sb+0x14/0x16
> [ 480.775278] [<ffffffff81125937>] iterate_supers+0x6f/0xc4
> [ 480.775847] [<ffffffff81149bf4>] sys_sync+0x35/0x83
> [ 480.776460] [<ffffffff8148da52>] system_call_fastpath+0x12/0x17
>
>
> Here is a quick hack.
>
> diff --git a/block/bio.c b/block/bio.c
> index 4020ccc..fbc7108 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -829,6 +829,11 @@ int bio_add_page(struct bio *bio, struct page *page,
> if (bvec_to_phys(bv) + bv->bv_len ==
> page_to_phys(page) + offset) {
> bv->bv_len += len;
> + /*
> + * Page is not added to bio vec.
> + * Clear PG_writeback so
> filemap_fdatawait_range() won't wait for it.
> + */
> + TestClearPageWriteback(page);
> goto done;
> }
> }
>
> Thanks,
> Ming

Try this fix:

diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index b24a2541a9..3d2610b02e 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -63,15 +63,15 @@ static void buffer_io_error(struct buffer_head *bh)

static void ext4_finish_bio(struct bio *bio)
{
- int i;
int error = !test_bit(BIO_UPTODATE, &bio->bi_flags);
- struct bio_vec *bvec;
+ struct bio_vec bvec;
+ struct bvec_iter iter;

- bio_for_each_segment_all(bvec, bio, i) {
- struct page *page = bvec->bv_page;
+ bio_for_each_page_all(bvec, bio, iter) {
+ struct page *page = bvec.bv_page;
struct buffer_head *bh, *head;
- unsigned bio_start = bvec->bv_offset;
- unsigned bio_end = bio_start + bvec->bv_len;
+ unsigned bio_start = bvec.bv_offset;
+ unsigned bio_end = bio_start + bvec.bv_len;
unsigned under_io = 0;
unsigned long flags;

2014-12-10 23:11:02

by Ming Lin

[permalink] [raw]
Subject: Re: Block layer projects that I haven't had time for

> On Wed, Dec 10, 2014 at 02:42:14PM -0800, Ming Lin wrote:
>> On Mon, Dec 8, 2014 at 3:48 AM, Dongsu Park
>> <[email protected]> wrote:
>> > Thanks for the reply.
>> >
>> > On 05.12.2014 19:02, Kent Overstreet wrote:
>> >> On Thu, Dec 04, 2014 at 12:00:27PM +0100, Dongsu Park wrote:
>> >> > Playing a little with your block_stuff tree based on 3.15, however,
>> >> > I think there still seems to be a couple of issues.
>> >> > First of all, it doesn't work with virtio-blk. A testing Qemu VM
>> panics
>> >> > at the very early stage of booting. This issue should be addressed
>> as
>> >> > the first step, so that other parts can be tested.
>> >>
>> >> Really? I was testing with virtio-blk, that's odd..
>> >
>> > The culprit seems to be the plugging commit.
>> > Before that change, it works well also with virtio-blk.
>> > Though that's not the only issue...
>> >
>> >> > Moreover, I've already tried to rebase these patches on top of
>> current
>> >> > mainline, 3.18-rc7. It's now compilable, but it seems to introduce
>> >> > more bugs about direct-IO. I didn't manage to find out the reason.
>> >> > I'd need to also look at the previous review comments in [1], [2].
>> >> >
>> >> > Don't you have other trees based on top of 3.17 or higher?
>> >> > If not, can I create my own tree based on 3.18-rc7 to publish?
>> >>
>> >> Yeah, I'd post what you have now and I'll try and take a look.
>> >
>> > I've created a git tree to include what I have right now.
>> > Please see <https://github.com/dongsupark/linux>.
>> >
>> > To be able to handle different issues one by one,
>> > I got the entire tree separated out into 4 branches based on 3.18.
>> >
>> > * block-generic-req-for-next : the most stable branch you can test
>> with.
>> > With this branch, you can test most of block drivers as well as file
>> > systems with less critical bugs. Though it's not 100% perfect yet,
>> > e.g. btrfs doesn't seem to work quite well. Thus more tests are
>> needed.
>> >
>> > * block-mpage-bvecs-for-next : block-generic-req-for-next + multipage
>> bvecs.
>> > This branch shows a critical issue that writing blocks to ext4
>> rootfs
>> > causes the whole system to crash. Need-to-investigate.
>>
>> I tried block-mpage-bvecs-for-next branch on qemu-kvm with ext4 rootfs.
>> Run "sync" will stuck in kernel.
>>
>> [ 480.751901] INFO: task sync:4424 blocked for more than 120 seconds.
>> [ 480.753064] Not tainted 3.18.0-00025-g46c8231 #39
>> [ 480.753720] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>> [ 480.754737] sync D ffff88001fc11180 0 4424 4338
>> 0x00000000
>> [ 480.755719] ffff88001cdfbc98 0000000000000086 ffff88001cdfbba8
>> ffff880014adefc0
>> [ 480.756810] 0000000000011180 0000000000004000 ffffffff81813460
>> ffff880014adefc0
>> [ 480.758102] ffff88001cdfbbc8 ffffffff812f08be ffff88001cdfbc18
>> ffff880014adf028
>> [ 480.759454] Call Trace:
>> [ 480.759852] [<ffffffff812f08be>] ? debug_smp_processor_id+0x17/0x19
>> [ 480.760609] [<ffffffff8106093e>] ? __enqueue_entity+0x69/0x6b
>> [ 480.761318] [<ffffffff8106017e>] ? __dequeue_entity+0x33/0x38
>> [ 480.762026] [<ffffffff810601ab>] ? set_next_entity+0x28/0x7d
>> [ 480.762739] [<ffffffff8105a4fb>] ? get_parent_ip+0xf/0x3f
>> [ 480.763425] [<ffffffff8108562b>] ? ktime_get+0x50/0x8f
>> [ 480.763848] [<ffffffff8148abdb>] ? bit_wait_timeout+0x60/0x60
>> [ 480.764555] [<ffffffff8148a6be>] schedule+0x6a/0x6c
>> [ 480.765186] [<ffffffff8148a74f>] io_schedule+0x8f/0xcd
>> [ 480.765841] [<ffffffff8148ac19>] bit_wait_io+0x3e/0x42
>> [ 480.766493] [<ffffffff8148ae80>] __wait_on_bit+0x4d/0x86
>> [ 480.767183] [<ffffffff810d4302>] ? find_get_pages_tag+0x106/0x133
>> [ 480.767847] [<ffffffff810d4a63>] wait_on_page_bit+0x76/0x78
>> [ 480.768532] [<ffffffff8106ab59>] ? wake_atomic_t_function+0x2d/0x2d
>> [ 480.769262] [<ffffffff810d511f>] filemap_fdatawait_range+0x7e/0x11d
>> [ 480.769992] [<ffffffff8148a639>] ? preempt_schedule+0x36/0x51
>> [ 480.770677] [<ffffffff8105a4fb>] ? get_parent_ip+0xf/0x3f
>> [ 480.771848] [<ffffffff810d51df>] filemap_fdatawait+0x21/0x23
>> [ 480.772530] [<ffffffff811458ce>] sync_inodes_sb+0x158/0x1aa
>> [ 480.773201] [<ffffffff81480303>] ? br_mdb_dump+0x225/0x495
>> [ 480.773885] [<ffffffff81149ad8>] ? fdatawrite_one_bdev+0x18/0x18
>> [ 480.774592] [<ffffffff81149aec>] sync_inodes_one_sb+0x14/0x16
>> [ 480.775278] [<ffffffff81125937>] iterate_supers+0x6f/0xc4
>> [ 480.775847] [<ffffffff81149bf4>] sys_sync+0x35/0x83
>> [ 480.776460] [<ffffffff8148da52>] system_call_fastpath+0x12/0x17
>>
>>
>> Here is a quick hack.
>>
>> diff --git a/block/bio.c b/block/bio.c
>> index 4020ccc..fbc7108 100644
>> --- a/block/bio.c
>> +++ b/block/bio.c
>> @@ -829,6 +829,11 @@ int bio_add_page(struct bio *bio, struct page
>> *page,
>> if (bvec_to_phys(bv) + bv->bv_len ==
>> page_to_phys(page) + offset) {
>> bv->bv_len += len;
>> + /*
>> + * Page is not added to bio vec.
>> + * Clear PG_writeback so
>> filemap_fdatawait_range() won't wait for it.
>> + */
>> + TestClearPageWriteback(page);
>> goto done;
>> }
>> }
>>
>> Thanks,
>> Ming
>
> Try this fix:

Yes, it fixed ext4 problem.

Just tried to edit a btrfs file.

[ 45.216351] BTRFS error (device sdb1): partial page write in btrfs with
offset 0 and length 8192
[ 45.217522] BTRFS critical (device sdb1): bad ordered accounting left 0
size 4096

Thanks,
Ming

>
> diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
> index b24a2541a9..3d2610b02e 100644
> --- a/fs/ext4/page-io.c
> +++ b/fs/ext4/page-io.c
> @@ -63,15 +63,15 @@ static void buffer_io_error(struct buffer_head *bh)
>
> static void ext4_finish_bio(struct bio *bio)
> {
> - int i;
> int error = !test_bit(BIO_UPTODATE, &bio->bi_flags);
> - struct bio_vec *bvec;
> + struct bio_vec bvec;
> + struct bvec_iter iter;
>
> - bio_for_each_segment_all(bvec, bio, i) {
> - struct page *page = bvec->bv_page;
> + bio_for_each_page_all(bvec, bio, iter) {
> + struct page *page = bvec.bv_page;
> struct buffer_head *bh, *head;
> - unsigned bio_start = bvec->bv_offset;
> - unsigned bio_end = bio_start + bvec->bv_len;
> + unsigned bio_start = bvec.bv_offset;
> + unsigned bio_end = bio_start + bvec.bv_len;
> unsigned under_io = 0;
> unsigned long flags;
>
>

2014-12-11 10:08:17

by Dongsu Park

[permalink] [raw]
Subject: Re: Block layer projects that I haven't had time for

Hi Ming & Kent,

On 10.12.2014 23:11, Ming Lin wrote:
> > On Wed, Dec 10, 2014 at 02:42:14PM -0800, Ming Lin wrote:
> > Try this fix:
> Yes, it fixed ext4 problem.

@kent: Thank you for the patch. Indeed it fixes the ext4 lockup I've seen.
I've applied it to my tree, under the branch block-mpage-bvecs-for-next.
See 0d2e05525a58 ("fs/ext4: fix a lockup when writing blocks into ext4
rootfs") <https://github.com/dongsupark/linux/commit/0d2e05525a58>.

After that of course, more bugs start to appear, e.g. crash with virtio-blk,
like we'd have opened a can of worms. ;-)

> Just tried to edit a btrfs file.
>
> [ 45.216351] BTRFS error (device sdb1): partial page write in btrfs with
> offset 0 and length 8192
> [ 45.217522] BTRFS critical (device sdb1): bad ordered accounting left 0
> size 4096

@ming: I guess you managed to see this error as you're testing with a
SCSI device, not virtio-blk device like me.
Are you seeing it without any back traces?
Does the attached patch fix your issue?
(This is already included in the branch block-mpage-bvecs-for-next.)

Thanks,
Dongsu

====

>From 7cef37e357b4fd636b3d4aa296e8b67ba8db66d1 Mon Sep 17 00:00:00 2001
From: Dongsu Park <[email protected]>
Date: Tue, 9 Dec 2014 18:10:10 +0100
Subject: [PATCH] btrfs: use a correct function for bvec iteration in
btrfs_csum_one_bio()

Commit 94607a8a("block: Convert various code to bio_for_each_page()")
introduced a critical bug in btrfs_csum_one_bio() using
bio_for_each_page_all() for iterating through each bvec.
That should actually call bio_for_each_page() to take the current
offset into account. Without this fix, xfstests/btrfs/012 would
end up with lockup with warnings in btrfs_add_ordered_sum(), because
iter.bi_size becomes < 0.

Signed-off-by: Dongsu Park <[email protected]>
---
fs/btrfs/file-item.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index 6a81176..c7ae23c 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -447,7 +447,7 @@ int btrfs_csum_one_bio(struct btrfs_root *root, struct inode *inode,
sums->bytenr = (u64)bio->bi_iter.bi_sector << 9;
index = 0;

- bio_for_each_page_all(bvec, bio, iter) {
+ bio_for_each_page(bvec, bio, iter) {
if (!contig)
offset = page_offset(bvec.bv_page) + bvec.bv_offset;

--
1.9.3

2014-12-11 10:12:14

by Kent Overstreet

[permalink] [raw]
Subject: Re: Block layer projects that I haven't had time for

On Thu, Dec 11, 2014 at 11:07:51AM +0100, Dongsu Park wrote:
> Hi Ming & Kent,
>
> On 10.12.2014 23:11, Ming Lin wrote:
> > > On Wed, Dec 10, 2014 at 02:42:14PM -0800, Ming Lin wrote:
> > > Try this fix:
> > Yes, it fixed ext4 problem.
>
> @kent: Thank you for the patch. Indeed it fixes the ext4 lockup I've seen.
> I've applied it to my tree, under the branch block-mpage-bvecs-for-next.
> See 0d2e05525a58 ("fs/ext4: fix a lockup when writing blocks into ext4
> rootfs") <https://github.com/dongsupark/linux/commit/0d2e05525a58>.
>
> After that of course, more bugs start to appear, e.g. crash with virtio-blk,
> like we'd have opened a can of worms. ;-)

Yeah :) You'll need to audit every usage of bio_for_each_segment() to figure out
which ones need converting to bio_for_each_page(). I did some of this - but just
enough to get it working for me, and I haven't worked on that code since at
least 3.14 so really everything needs to be audited.

Hopefully in the future a lot of stuff could be converted to
bio_for_each_segment() and changed to use more efficient algorithms, but that
will be a lot more work. With bio_for_each_page() it shouldn't be too hard to
get it stable, though.

2014-12-11 10:21:56

by Dongsu Park

[permalink] [raw]
Subject: Re: Block layer projects that I haven't had time for

Hi Kent,

On 10.12.2014 14:49, Kent Overstreet wrote:
> On Mon, Dec 08, 2014 at 12:48:13PM +0100, Dongsu Park wrote:
> > Thanks for the reply.
> >
> > On 05.12.2014 19:02, Kent Overstreet wrote:
> > > On Thu, Dec 04, 2014 at 12:00:27PM +0100, Dongsu Park wrote:
> > > > Playing a little with your block_stuff tree based on 3.15, however,
> > > > I think there still seems to be a couple of issues.
> > > > First of all, it doesn't work with virtio-blk. A testing Qemu VM panics
> > > > at the very early stage of booting. This issue should be addressed as
> > > > the first step, so that other parts can be tested.
> > >
> > > Really? I was testing with virtio-blk, that's odd..
> >
> > The culprit seems to be the plugging commit.
> > Before that change, it works well also with virtio-blk.
> > Though that's not the only issue...
>
> Ohh - the plugging stuff was definitely broken and very preliminary. I'm
> surprised that code even compiled (or did it?)

No, it didn't. But it was just one of dozens of compile errors.
It's no big deal. After all someone has to get it compiled when rebasing
a patchset on top of a new kernel. :-)

> > > > Moreover, I've already tried to rebase these patches on top of current
> > > > mainline, 3.18-rc7. It's now compilable, but it seems to introduce
> > > > more bugs about direct-IO. I didn't manage to find out the reason.
> > > > I'd need to also look at the previous review comments in [1], [2].
> > > >
> > > > Don't you have other trees based on top of 3.17 or higher?
> > > > If not, can I create my own tree based on 3.18-rc7 to publish?
> > >
> > > Yeah, I'd post what you have now and I'll try and take a look.
> >
> > I've created a git tree to include what I have right now.
> > Please see <https://github.com/dongsupark/linux>.
> >
> > To be able to handle different issues one by one,
> > I got the entire tree separated out into 4 branches based on 3.18.
> >
> > * block-generic-req-for-next : the most stable branch you can test with.
> > With this branch, you can test most of block drivers as well as file
> > systems with less critical bugs. Though it's not 100% perfect yet,
> > e.g. btrfs doesn't seem to work quite well. Thus more tests are needed.
> >
> > * block-mpage-bvecs-for-next : block-generic-req-for-next + multipage bvecs.
> > This branch shows a critical issue that writing blocks to ext4 rootfs
> > causes the whole system to crash. Need-to-investigate.
> >
> > * block-plug-for-next: block-mpage-bvecs-for-next + plugging.
> > This branch has an additional bug with virtio-blk, that the kernel
> > panics at the very early stage of booting. Need-to-investigate.
> >
> > * block-dio-rewrite-for-next: block-plug-for-next + dio-rewriting.
> > This branch has more issues w.r.t. direct-io. For example, dio_init()
> > causes the kernel to panic at the early stage of booting.
> >
> > All branches are compilable. But it's still somehow half-way complete.
> > Commit messages should be properly written too, so that they can be posted
> > to mailing lists.
>
> Cool!
>
> I'll take a look. I think the multipage bvec stuff still had a fair amount of
> work to do, and getting the dio rewrite finished will be _hard_ (I was close to
> having it working at one point, but xfstests was finding some subtle issues).
>
> It'll be pretty awesome if we can get even some of this stuff in.

Thanks. I actually want to tidy up the 1st branch, block-generic-req-for-next,
moving several commits about bio_for_each_page() to block-mpage-bvecs-for-next.
Then I'd be able to make the 1st branch with ca. 14 patches, so they could be
posted to mailing lists a little easily. Though I still couldn't get it
done due to some code dependencies. I'll try it again in the next days.

Dongsu

2014-12-11 19:16:38

by Ming Lin

[permalink] [raw]
Subject: Re: Block layer projects that I haven't had time for

On Thu, Dec 11, 2014 at 2:07 AM, Dongsu Park
<[email protected]> wrote:
> Hi Ming & Kent,
>
> On 10.12.2014 23:11, Ming Lin wrote:
>> > On Wed, Dec 10, 2014 at 02:42:14PM -0800, Ming Lin wrote:
>> > Try this fix:
>> Yes, it fixed ext4 problem.
>
> @kent: Thank you for the patch. Indeed it fixes the ext4 lockup I've seen.
> I've applied it to my tree, under the branch block-mpage-bvecs-for-next.
> See 0d2e05525a58 ("fs/ext4: fix a lockup when writing blocks into ext4
> rootfs") <https://github.com/dongsupark/linux/commit/0d2e05525a58>.
>
> After that of course, more bugs start to appear, e.g. crash with virtio-blk,
> like we'd have opened a can of worms. ;-)
>
>> Just tried to edit a btrfs file.
>>
>> [ 45.216351] BTRFS error (device sdb1): partial page write in btrfs with
>> offset 0 and length 8192
>> [ 45.217522] BTRFS critical (device sdb1): bad ordered accounting left 0
>> size 4096
>
> @ming: I guess you managed to see this error as you're testing with a
> SCSI device, not virtio-blk device like me.
> Are you seeing it without any back traces?
> Does the attached patch fix your issue?
> (This is already included in the branch block-mpage-bvecs-for-next.)

Hi Dongsu,

Yes, I'm testing with a SCSI device.
Tested your latest block-mpage-bvecs-for-next.
Same problem with BTRFS and no any back traces.

Thanks,
Ming

>
> Thanks,
> Dongsu
>
> ====
>
> From 7cef37e357b4fd636b3d4aa296e8b67ba8db66d1 Mon Sep 17 00:00:00 2001
> From: Dongsu Park <[email protected]>
> Date: Tue, 9 Dec 2014 18:10:10 +0100
> Subject: [PATCH] btrfs: use a correct function for bvec iteration in
> btrfs_csum_one_bio()
>
> Commit 94607a8a("block: Convert various code to bio_for_each_page()")
> introduced a critical bug in btrfs_csum_one_bio() using
> bio_for_each_page_all() for iterating through each bvec.
> That should actually call bio_for_each_page() to take the current
> offset into account. Without this fix, xfstests/btrfs/012 would
> end up with lockup with warnings in btrfs_add_ordered_sum(), because
> iter.bi_size becomes < 0.
>
> Signed-off-by: Dongsu Park <[email protected]>
> ---
> fs/btrfs/file-item.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
> index 6a81176..c7ae23c 100644
> --- a/fs/btrfs/file-item.c
> +++ b/fs/btrfs/file-item.c
> @@ -447,7 +447,7 @@ int btrfs_csum_one_bio(struct btrfs_root *root, struct inode *inode,
> sums->bytenr = (u64)bio->bi_iter.bi_sector << 9;
> index = 0;
>
> - bio_for_each_page_all(bvec, bio, iter) {
> + bio_for_each_page(bvec, bio, iter) {
> if (!contig)
> offset = page_offset(bvec.bv_page) + bvec.bv_offset;
>
> --
> 1.9.3
>

2014-12-12 06:32:35

by Ming Lin

[permalink] [raw]
Subject: Re: Block layer projects that I haven't had time for

On 12/11/2014 02:07 AM, Dongsu Park wrote:
> Hi Ming & Kent,
>
> On 10.12.2014 23:11, Ming Lin wrote:
>>> On Wed, Dec 10, 2014 at 02:42:14PM -0800, Ming Lin wrote:
>>> Try this fix:
>> Yes, it fixed ext4 problem.
>
> @kent: Thank you for the patch. Indeed it fixes the ext4 lockup I've seen.
> I've applied it to my tree, under the branch block-mpage-bvecs-for-next.
> See 0d2e05525a58 ("fs/ext4: fix a lockup when writing blocks into ext4
> rootfs") <https://github.com/dongsupark/linux/commit/0d2e05525a58>.
>
> After that of course, more bugs start to appear, e.g. crash with virtio-blk,
> like we'd have opened a can of worms. ;-)
>
>> Just tried to edit a btrfs file.
>>
>> [ 45.216351] BTRFS error (device sdb1): partial page write in btrfs with
>> offset 0 and length 8192
>> [ 45.217522] BTRFS critical (device sdb1): bad ordered accounting left 0
>> size 4096
>
> @ming: I guess you managed to see this error as you're testing with a
> SCSI device, not virtio-blk device like me.
> Are you seeing it without any back traces?
> Does the attached patch fix your issue?
> (This is already included in the branch block-mpage-bvecs-for-next.)

How about below fix?
"bvec.bv_len != PAGE_CACHE_SIZE" should be not valid any more, because now bio
handles arbitrary size, right?

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 2ac08e2..482c89c 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2468,7 +2468,7 @@ static void end_bio_extent_writepage(struct bio *bio, int err)
* advance bv_offset and adjust bv_len to compensate.
* Print a warning for nonzero offsets, and an error
* if they don't add up to a full page. */
- if (bvec.bv_offset || bvec.bv_len != PAGE_CACHE_SIZE) {
+ if (bvec.bv_offset) {
if (bvec.bv_offset + bvec.bv_len != PAGE_CACHE_SIZE)
btrfs_err(BTRFS_I(page->mapping->host)->root->fs_info,
"partial page write in btrfs with offset %u and length %u",
@@ -2481,7 +2481,7 @@ static void end_bio_extent_writepage(struct bio *bio, int err)
}

start = page_offset(page);
- end = start + bvec.bv_offset + bvec.bv_len - 1;
+ end = start + bvec.bv_offset + min(bvec.bv_len, PAGE_CACHE_SIZE) - 1;

if (end_extent_writepage(page, err, start, end))
continue;

Thanks,
Ming

>
> Thanks,
> Dongsu
>
> ====
>
> From 7cef37e357b4fd636b3d4aa296e8b67ba8db66d1 Mon Sep 17 00:00:00 2001
> From: Dongsu Park <[email protected]>
> Date: Tue, 9 Dec 2014 18:10:10 +0100
> Subject: [PATCH] btrfs: use a correct function for bvec iteration in
> btrfs_csum_one_bio()
>
> Commit 94607a8a("block: Convert various code to bio_for_each_page()")
> introduced a critical bug in btrfs_csum_one_bio() using
> bio_for_each_page_all() for iterating through each bvec.
> That should actually call bio_for_each_page() to take the current
> offset into account. Without this fix, xfstests/btrfs/012 would
> end up with lockup with warnings in btrfs_add_ordered_sum(), because
> iter.bi_size becomes < 0.
>
> Signed-off-by: Dongsu Park <[email protected]>
> ---
> fs/btrfs/file-item.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
> index 6a81176..c7ae23c 100644
> --- a/fs/btrfs/file-item.c
> +++ b/fs/btrfs/file-item.c
> @@ -447,7 +447,7 @@ int btrfs_csum_one_bio(struct btrfs_root *root, struct inode *inode,
> sums->bytenr = (u64)bio->bi_iter.bi_sector << 9;
> index = 0;
>
> - bio_for_each_page_all(bvec, bio, iter) {
> + bio_for_each_page(bvec, bio, iter) {
> if (!contig)
> offset = page_offset(bvec.bv_page) + bvec.bv_offset;
>
>

2014-12-12 12:40:16

by Dongsu Park

[permalink] [raw]
Subject: Re: Block layer projects that I haven't had time for

On 11.12.2014 22:32, Ming Lin wrote:
> On 12/11/2014 02:07 AM, Dongsu Park wrote:
> > On 10.12.2014 23:11, Ming Lin wrote:
> >> Just tried to edit a btrfs file.
> >>
> >> [ 45.216351] BTRFS error (device sdb1): partial page write in btrfs with
> >> offset 0 and length 8192
> >> [ 45.217522] BTRFS critical (device sdb1): bad ordered accounting left 0
> >> size 4096
> >
> > @ming: I guess you managed to see this error as you're testing with a
> > SCSI device, not virtio-blk device like me.
> > Are you seeing it without any back traces?
> > Does the attached patch fix your issue?
> > (This is already included in the branch block-mpage-bvecs-for-next.)
>
> How about below fix?
> "bvec.bv_len != PAGE_CACHE_SIZE" should be not valid any more, because now bio
> handles arbitrary size, right?

Thanks for figuring it out. :-)
I applied the change not only in _writepage(), but also in _readpage(),
also with some minor changes. See the attached patch.

I've just tested it under btrfs on a SCSI device via scsi_debug.
No such errors any more. Hope it works also for you.
Probably there will be a lot of places where a single page is assumed.

Cheers,
Dongsu

----
>From db3abe857f8c0f722c5ade450389e78e79ee7d17 Mon Sep 17 00:00:00 2001
From: Ming Lin <[email protected]>
Date: Thu, 11 Dec 2014 22:32:31 -0800
Subject: [PATCH] btrfs: allow read-/writing an extent to handle arbitrarily
sized bios

Fix bugs in end_bio_extent_{read,write}page(), upon reading or writing
biovec that has length of multiple pages. The condition "bvec.bv_len
!= PAGE_CACHE_SIZE" would not be always valid any more, because now
biovec is able to handle arbitrary size of bio. Without this patch,
read/write IO stalls with the following log:

BTRFS error (device sdb1): partial page write in btrfs with offset 0 and length 8192
BTRFS critical (device sdb1): bad ordered accounting left 0 size 4096

Signed-off-by: Ming Lin <[email protected]>
Signed-off-by: Dongsu Park <[email protected]>
---
fs/btrfs/extent_io.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 2ac08e2..790a83b 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2468,7 +2468,7 @@ static void end_bio_extent_writepage(struct bio *bio, int err)
* advance bv_offset and adjust bv_len to compensate.
* Print a warning for nonzero offsets, and an error
* if they don't add up to a full page. */
- if (bvec.bv_offset || bvec.bv_len != PAGE_CACHE_SIZE) {
+ if (bvec.bv_offset) {
if (bvec.bv_offset + bvec.bv_len != PAGE_CACHE_SIZE)
btrfs_err(BTRFS_I(page->mapping->host)->root->fs_info,
"partial page write in btrfs with offset %u and length %u",
@@ -2481,7 +2481,8 @@ static void end_bio_extent_writepage(struct bio *bio, int err)
}

start = page_offset(page);
- end = start + bvec.bv_offset + bvec.bv_len - 1;
+ end = start + bvec.bv_offset
+ + min_t(unsigned int, bvec.bv_len, PAGE_CACHE_SIZE) - 1;

if (end_extent_writepage(page, err, start, end))
continue;
@@ -2548,7 +2549,7 @@ static void end_bio_extent_readpage(struct bio *bio, int err)
* advance bv_offset and adjust bv_len to compensate.
* Print a warning for nonzero offsets, and an error
* if they don't add up to a full page. */
- if (bvec.bv_offset || bvec.bv_len != PAGE_CACHE_SIZE) {
+ if (bvec.bv_offset) {
if (bvec.bv_offset + bvec.bv_len != PAGE_CACHE_SIZE)
btrfs_err(BTRFS_I(page->mapping->host)->root->fs_info,
"partial page read in btrfs with offset %u and length %u",
@@ -2561,7 +2562,8 @@ static void end_bio_extent_readpage(struct bio *bio, int err)
}

start = page_offset(page);
- end = start + bvec.bv_offset + bvec.bv_len - 1;
+ end = start + bvec.bv_offset
+ + min_t(unsigned int, bvec.bv_len, PAGE_CACHE_SIZE) - 1;
len = bvec.bv_len;

mirror = io_bio->mirror_num;
--
2.1.0