LinuxLists.cc - next bio iters break discard?

2014-01-13 03:53:15

Subject: next bio iters break discard?

When I try to exercise heavy swapping with discard on mmotm 2014-01-09,
I soon hit a NULL pointer dereference in __blk_recalc_rq_segments():

__blk_recalc_rq_segments
blk_recount_segments
ll_back_merge_fn
bio_attempt_back_merge
blk_queue_bio
generic_make_request
submit_bio
blkdev_issue_discard
swap_do_scheduled_discard
scan_swap_map_try_ssd_cluster
scan_swap_map
get_swap_page
add_to_swap
shrink_page_list
etc. etc.

The crash is on the NULL struct page pointer in page_to_pfn(bv.bv_page)
on line 35 of block/blk-merge.c.

The code around there is not very different from 3.13-rc8 (which doesn't
crash), and I didn't notice REQ_DISCARD or bio_has_data() checks removed.

I think it worked before because the old bio_for_each_segment()
iterator was a straightforward "i < bio->bi_vcnt" loop which would
do nothing when bi_vcnt is 0; but the new iterators are relying
(perhaps) on bio->bi_iter.bi_size which is non-0 despite no data?

I expect it would crash in the same way on other recent nexts and
mmotms, I've not tried.

Hugh

2014-01-14 02:33:51

by Kent Overstreet

[permalink] [raw]

Subject: Re: next bio iters break discard?

On Sun, Jan 12, 2014 at 07:52:40PM -0800, Hugh Dickins wrote:
> When I try to exercise heavy swapping with discard on mmotm 2014-01-09,
> I soon hit a NULL pointer dereference in __blk_recalc_rq_segments():
>
> __blk_recalc_rq_segments
> blk_recount_segments
> ll_back_merge_fn
> bio_attempt_back_merge
> blk_queue_bio
> generic_make_request
> submit_bio
> blkdev_issue_discard
> swap_do_scheduled_discard
> scan_swap_map_try_ssd_cluster
> scan_swap_map
> get_swap_page
> add_to_swap
> shrink_page_list
> etc. etc.
>
> The crash is on the NULL struct page pointer in page_to_pfn(bv.bv_page)
> on line 35 of block/blk-merge.c.
>
> The code around there is not very different from 3.13-rc8 (which doesn't
> crash), and I didn't notice REQ_DISCARD or bio_has_data() checks removed.
>
> I think it worked before because the old bio_for_each_segment()
> iterator was a straightforward "i < bio->bi_vcnt" loop which would
> do nothing when bi_vcnt is 0; but the new iterators are relying
> (perhaps) on bio->bi_iter.bi_size which is non-0 despite no data?
>
> I expect it would crash in the same way on other recent nexts and
> mmotms, I've not tried.
>
> Hugh

Ugh, discards. Wonder why this wasn't seen sooner, I can't figure out what the
null pointer deref actually was but if it was __blk_recalc_rq_segments() blowing
up, that shouldn't have had to wait for two discards to get merged to get
called.

(calling bio_for_each_segment() on REQ_DISCARD/REQ_WRITE_SAME bios should in
general work; bio_advance_iter() checks against BIO_NO_ADVANCE_ITER_MASK to
determine whether to advance down the bvec or just decrement bi_size. But for
counting segments bio_for_each_segment() is definitely not what we want.)

I think for discards we can deal with this easily enough -
__blk_recalc_rq_segments() will have to special case them - but there's a
similar (but worse) issue with WRITE_SAME, and looking at the code it does
attempt to merge WRITE_SAME requests too.

Jens, Martin - are we sure we want to merge WRITE_SAME requests? I'm not sure I
even see how that makes sense, at the very least it's a potential minefield.

2014-01-14 04:06:48

by Martin K. Petersen

[permalink] [raw]

Subject: Re: next bio iters break discard?

>>>>> "Kent" == Kent Overstreet <[email protected]> writes:

Kent,

Kent> I think for discards we can deal with this easily enough -
Kent> __blk_recalc_rq_segments() will have to special case them - but
Kent> there's a similar (but worse) issue with WRITE_SAME, and looking
Kent> at the code it does attempt to merge WRITE_SAME requests too.

DISCARD bios have no payload going down the stack. They get a payload
attached in the sd driver and will therefore have a single bvec at
completion time.

WRITE_SAME bios have a single bvec payload throughout their lifetime.

For both these types of requests we never attempt to merge the actual
payloads. But the block range worked on may shrink or grow as the bio is
split or merged going down the stack.

IOW, DISCARD, WRITE SAME and the impending COPY requests do not have a
1:1 mapping between the block range worked on and the size of any bvecs
attached. Your recent changes must have changed the way we handled that
in the past.

I'll take a look tomorrow...

--
Martin K. Petersen Oracle Linux Engineering

2014-01-14 04:48:48

by Kent Overstreet

[permalink] [raw]

Subject: Re: next bio iters break discard?

On Mon, Jan 13, 2014 at 11:06:33PM -0500, Martin K. Petersen wrote:
> >>>>> "Kent" == Kent Overstreet <[email protected]> writes:
>
> Kent,
>
> Kent> I think for discards we can deal with this easily enough -
> Kent> __blk_recalc_rq_segments() will have to special case them - but
> Kent> there's a similar (but worse) issue with WRITE_SAME, and looking
> Kent> at the code it does attempt to merge WRITE_SAME requests too.
>
> DISCARD bios have no payload going down the stack. They get a payload
> attached in the sd driver and will therefore have a single bvec at
> completion time.
>
> WRITE_SAME bios have a single bvec payload throughout their lifetime.
>
> For both these types of requests we never attempt to merge the actual
> payloads. But the block range worked on may shrink or grow as the bio is
> split or merged going down the stack.
>
> IOW, DISCARD, WRITE SAME and the impending COPY requests do not have a
> 1:1 mapping between the block range worked on and the size of any bvecs
> attached. Your recent changes must have changed the way we handled that
> in the past.

Yeah - but with WRITE_SAME bios, wouldn't we at least have to check that they're
writing the same data to merge them?

2014-01-14 20:17:46

by Martin K. Petersen

[permalink] [raw]

Subject: Re: next bio iters break discard?

>>>>> "Kent" == Kent Overstreet <[email protected]> writes:

>> IOW, DISCARD, WRITE SAME and the impending COPY requests do not have
>> a 1:1 mapping between the block range worked on and the size of any
>> bvecs attached. Your recent changes must have changed the way we
>> handled that in the past.

Kent> Yeah - but with WRITE_SAME bios, wouldn't we at least have to
Kent> check that they're writing the same data to merge them?

We do. Check blk_write_same_mergeable().

--
Martin K. Petersen Oracle Linux Engineering

2014-01-14 22:21:31

by Kent Overstreet

[permalink] [raw]

Subject: Re: next bio iters break discard?

On Tue, Jan 14, 2014 at 03:17:32PM -0500, Martin K. Petersen wrote:
> >>>>> "Kent" == Kent Overstreet <[email protected]> writes:
>
> >> IOW, DISCARD, WRITE SAME and the impending COPY requests do not have
> >> a 1:1 mapping between the block range worked on and the size of any
> >> bvecs attached. Your recent changes must have changed the way we
> >> handled that in the past.
>
> Kent> Yeah - but with WRITE_SAME bios, wouldn't we at least have to
> Kent> check that they're writing the same data to merge them?
>
> We do. Check blk_write_same_mergeable().

Aha, I missed that.

Side note, one of the things on my todo list/wishlist is make a separate
enum for bio type - bare flush, normal read/write, scsi command, discard
and write same. In particular with bi_size meaning different things
depending on the type of the command, I think having an enum we can
switch off of where appropriate instead of a bunch of random flags will
make things a hell of a lot saner.

Looking more at this code, I'm not sure it did what we wanted before for
REQ_DISCARD and REQ_WRITE_SAME bios/requests - I suspect for
REQ_WRITE_SAME requests that had been merged it overcounted.

There's also that horrible hack in the scsi (?) code Christoph added for
discards - where the driver adds a page to the bio - if the driver is
counting the number of segments _after_ that god only knows what is
supposed to happen.

Does the below patch look like what we want? I'm assuming that if
multiple WRITE_SAME bios are merged, since they're all writing the same
data we can consider the entire request to be a single segment.

commit 1755e7ffc5745591d37b8956ce2512f4052a104a
Author: Kent Overstreet <[email protected]>
Date: Tue Jan 14 14:22:01 2014 -0800

block: Explicitly handle discard/write same when counting segments

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 8f8adaa..7d977f8 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -21,6 +21,12 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
if (!bio)
return 0;

+ if (bio->bi_rw & REQ_DISCARD)
+ return 0;
+
+ if (bio->bi_rw & REQ_WRITE_SAME)
+ return 1;
+
fbio = bio;
cluster = blk_queue_cluster(q);
seg_size = 0;

2014-01-16 01:39:57

by Martin K. Petersen

[permalink] [raw]

Subject: Re: next bio iters break discard?

>>>>> "Kent" == Kent Overstreet <[email protected]> writes:

Kent> Side note, one of the things on my todo list/wishlist is make a
Kent> separate enum for bio type - bare flush, normal read/write, scsi
Kent> command, discard and write same. In particular with bi_size
Kent> meaning different things depending on the type of the command, I
Kent> think having an enum we can switch off of where appropriate
Kent> instead of a bunch of random flags will make things a hell of a
Kent> lot saner.

Yeah. That's one of the reasons we cleaned up the merge flags and
introduced bio_has_data() and bio_is_rw().

Kent> There's also that horrible hack in the scsi (?) code Christoph
Kent> added for discards - where the driver adds a page to the bio - if
Kent> the driver is counting the number of segments _after_ that god
Kent> only knows what is supposed to happen.

I manually do the bookkeeping in the SCSI disk driver. I.e. LLDs are
explicitly told to perform a 512-byte transfer. And when they're done I
complete bi_size bytes to the block layer. Kind of ugly but it's an
unfortunate side effect of bi_size being both the size of the
scatterlist and the block count. I did consider making them separate
entities but that means struct bio will grow for no reason for the
common case of read/write. Didn't seem worth the hassle...

Kent> Does the below patch look like what we want? I'm assuming that if
Kent> multiple WRITE_SAME bios are merged, since they're all writing the
Kent> same data we can consider the entire request to be a single
Kent> segment.

Looks OK to me.

Acked-by: Martin K. Petersen <[email protected]>

--
Martin K. Petersen Oracle Linux Engineering

2014-01-16 20:21:38

by Hugh Dickins

[permalink] [raw]

Subject: Re: next bio iters break discard?

On Tue, 14 Jan 2014, Kent Overstreet wrote:
>
> Does the below patch look like what we want? I'm assuming that if

You don't fill me with confidence ;)

> multiple WRITE_SAME bios are merged, since they're all writing the same
> data we can consider the entire request to be a single segment.
>
> commit 1755e7ffc5745591d37b8956ce2512f4052a104a
> Author: Kent Overstreet <[email protected]>
> Date: Tue Jan 14 14:22:01 2014 -0800
>
> block: Explicitly handle discard/write same when counting segments
>
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 8f8adaa..7d977f8 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -21,6 +21,12 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
> if (!bio)
> return 0;
>
> + if (bio->bi_rw & REQ_DISCARD)
> + return 0;
> +
> + if (bio->bi_rw & REQ_WRITE_SAME)
> + return 1;
> +
> fbio = bio;
> cluster = blk_queue_cluster(q);
> seg_size = 0;

For me this just shifts the crash,
from __blk_recalc_rq_segments() to blk_rq_map_sg():

blk_rq_map_sg
scsi_init_sgtable
scsi_init_io
scsi_setup_blk_pc_cmnd
sd_prep_fn
blk_peek_request
scsi_request_fn
__blk_run_queue
blk_run_queue
scsi_run_queue
scsi_next_command
scsi_io_completion
scsi_finish_command
scsi_softirq_done
blk_done_softirq
__do_softirq
irq_exit
do_IRQ
common_interrupt
<EOI>
cpuidle_idle_call
arch_cpu_idle
cpu_startup_entry
start_secondary

It's GPF'ing on struct scatter_list *sg 0x800000001473e064 in

static inline void sg_assign_page(struct scatterlist *sg, struct page *page)
{
unsigned long page_link = sg->page_link & 0x3;

It appears to be in the static inline __blk_segment_map_sg(),
and that GPF'ing address is what it just got from sg_next().

Sorry, this isn't the kind of dump you'll be used to, but it's the
best I can do at the moment, and I've just had to reboot the machine.

O, tried again and it hit the BUG_ON(count > sdb->table.nents)
on line 1048 of drivers/scsi/scsi_lib.c:

scsi_init_sgtable
<IRQ> scsi_init_io
scsi_setup_blk_pc_cmnd
sd_setup_discard_cmnd
sd_prep_fn
blk_peek_request
etc. as before

I'll have to leave the machine shortly - I'm rather hoping
you can do your own discard testing to see such crashes.

Thanks,
Hugh

2014-01-17 01:06:36

by Kent Overstreet

[permalink] [raw]

Subject: Re: next bio iters break discard?

On Thu, Jan 16, 2014 at 12:21:10PM -0800, Hugh Dickins wrote:
> On Tue, 14 Jan 2014, Kent Overstreet wrote:
> >
> > Does the below patch look like what we want? I'm assuming that if
>
> You don't fill me with confidence ;)
>
> > multiple WRITE_SAME bios are merged, since they're all writing the same
> > data we can consider the entire request to be a single segment.
> >
> > commit 1755e7ffc5745591d37b8956ce2512f4052a104a
> > Author: Kent Overstreet <[email protected]>
> > Date: Tue Jan 14 14:22:01 2014 -0800
> >
> > block: Explicitly handle discard/write same when counting segments
> >
> > diff --git a/block/blk-merge.c b/block/blk-merge.c
> > index 8f8adaa..7d977f8 100644
> > --- a/block/blk-merge.c
> > +++ b/block/blk-merge.c
> > @@ -21,6 +21,12 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
> > if (!bio)
> > return 0;
> >
> > + if (bio->bi_rw & REQ_DISCARD)
> > + return 0;
> > +
> > + if (bio->bi_rw & REQ_WRITE_SAME)
> > + return 1;
> > +
> > fbio = bio;
> > cluster = blk_queue_cluster(q);
> > seg_size = 0;
>
> For me this just shifts the crash,
> from __blk_recalc_rq_segments() to blk_rq_map_sg():
>
> blk_rq_map_sg
> scsi_init_sgtable
> scsi_init_io
> scsi_setup_blk_pc_cmnd
> sd_prep_fn
> blk_peek_request
> scsi_request_fn
> __blk_run_queue
> blk_run_queue
> scsi_run_queue
> scsi_next_command
> scsi_io_completion
> scsi_finish_command
> scsi_softirq_done
> blk_done_softirq
> __do_softirq
> irq_exit
> do_IRQ
> common_interrupt
> <EOI>
> cpuidle_idle_call
> arch_cpu_idle
> cpu_startup_entry
> start_secondary
>
> It's GPF'ing on struct scatter_list *sg 0x800000001473e064 in
>
> static inline void sg_assign_page(struct scatterlist *sg, struct page *page)
> {
> unsigned long page_link = sg->page_link & 0x3;
>
> It appears to be in the static inline __blk_segment_map_sg(),
> and that GPF'ing address is what it just got from sg_next().
>
> Sorry, this isn't the kind of dump you'll be used to, but it's the
> best I can do at the moment, and I've just had to reboot the machine.
>
> O, tried again and it hit the BUG_ON(count > sdb->table.nents)
> on line 1048 of drivers/scsi/scsi_lib.c:
>
> scsi_init_sgtable
> <IRQ> scsi_init_io
> scsi_setup_blk_pc_cmnd
> sd_setup_discard_cmnd
> sd_prep_fn
> blk_peek_request
> etc. as before
>
> I'll have to leave the machine shortly - I'm rather hoping
> you can do your own discard testing to see such crashes.

My simple hdparm/scsi_debug based test isn't hitting it - any suggestions on how
to reproduce it?

2014-01-17 01:21:10

by Kent Overstreet

[permalink] [raw]

Subject: Re: next bio iters break discard?

On Thu, Jan 16, 2014 at 12:21:10PM -0800, Hugh Dickins wrote:
> For me this just shifts the crash,
> from __blk_recalc_rq_segments() to blk_rq_map_sg():
>
> blk_rq_map_sg
> scsi_init_sgtable
> scsi_init_io
> scsi_setup_blk_pc_cmnd
> sd_prep_fn
> blk_peek_request
> scsi_request_fn
> __blk_run_queue
> blk_run_queue
> scsi_run_queue
> scsi_next_command
> scsi_io_completion
> scsi_finish_command
> scsi_softirq_done
> blk_done_softirq
> __do_softirq
> irq_exit
> do_IRQ
> common_interrupt
> <EOI>
> cpuidle_idle_call
> arch_cpu_idle
> cpu_startup_entry
> start_secondary
>
> It's GPF'ing on struct scatter_list *sg 0x800000001473e064 in
>
> static inline void sg_assign_page(struct scatterlist *sg, struct page *page)
> {
> unsigned long page_link = sg->page_link & 0x3;
>
> It appears to be in the static inline __blk_segment_map_sg(),
> and that GPF'ing address is what it just got from sg_next().
>
> Sorry, this isn't the kind of dump you'll be used to, but it's the
> best I can do at the moment, and I've just had to reboot the machine.
>
> O, tried again and it hit the BUG_ON(count > sdb->table.nents)
> on line 1048 of drivers/scsi/scsi_lib.c:
>
> scsi_init_sgtable
> <IRQ> scsi_init_io
> scsi_setup_blk_pc_cmnd
> sd_setup_discard_cmnd
> sd_prep_fn
> blk_peek_request
> etc. as before

Ok, I reread the code and figured it out - the analagous change also has to be
made in __blk_segment_map_sg(). I'll mail out a patch for this tomorrow after
I've stared at the code more and had less beer.

2014-01-31 17:18:09

by Hugh Dickins

[permalink] [raw]

Subject: Re: next bio iters break discard?

On Thu, 16 Jan 2014, Kent Overstreet wrote:
>
> Ok, I reread the code and figured it out - the analagous change also has to be
> made in __blk_segment_map_sg(). I'll mail out a patch for this tomorrow after
> I've stared at the code more and had less beer.

I'd been hoping for a patch to try, but now your changes have hit Linus's
tree: so today we have discard broken there too, crashing as originally
reported on the NULL struct page pointer in __blk_recalc_rq_segments()'s
page_to_pfn(bv.bv_page).

How to reproduce it? I hope you'll find easier ways, but I get it with
swapping to SSD (remember "swapon -d" to enable discard). I'm just doing
what I've done for years, running a pair of make -j20 kbuilds to tmpfs in
limited RAM (I use mem=700M with 1.5G of swap: but that would be far too
little RAM for a general config of current tree), to get plenty of fairly
chaotic swapping but good forward progress nonetheless (if the sizes are
too small, then it'll just thrash abysmally or be OOM-killed).

But please do send me a patch and I'll give it a try - thanks.

Hugh

2014-01-31 21:58:36

by Jens Axboe

[permalink] [raw]

Subject: Re: next bio iters break discard?

On Fri, Jan 31 2014, Hugh Dickins wrote:
> On Thu, 16 Jan 2014, Kent Overstreet wrote:
> >
> > Ok, I reread the code and figured it out - the analagous change also has to be
> > made in __blk_segment_map_sg(). I'll mail out a patch for this tomorrow after
> > I've stared at the code more and had less beer.
>
> I'd been hoping for a patch to try, but now your changes have hit Linus's
> tree: so today we have discard broken there too, crashing as originally
> reported on the NULL struct page pointer in __blk_recalc_rq_segments()'s
> page_to_pfn(bv.bv_page).
>
> How to reproduce it? I hope you'll find easier ways, but I get it with
> swapping to SSD (remember "swapon -d" to enable discard). I'm just doing
> what I've done for years, running a pair of make -j20 kbuilds to tmpfs in
> limited RAM (I use mem=700M with 1.5G of swap: but that would be far too
> little RAM for a general config of current tree), to get plenty of fairly
> chaotic swapping but good forward progress nonetheless (if the sizes are
> too small, then it'll just thrash abysmally or be OOM-killed).
>
> But please do send me a patch and I'll give it a try - thanks.

Hugh, sorry about that, this particular issue slipped my mind. Will get
this fixed up as soon as possible, hopefully before -rc1...

--
Jens Axboe

2014-02-04 10:16:34

by Kent Overstreet

[permalink] [raw]

Subject: [PATCH] block: Explicitly handle discard/write same segments

Immutable biovecs changed the way biovecs are interpreted - drivers no
longer use bi_vcnt, they have to go by bi_iter.bi_size (to allow for
using part of an existing segment without modifying it).

This breaks with discards and write_same bios, since for those bi_size
has nothing to do with segments in the biovec. So for now, we need a
fairly gross hack - we fortunately know that there will never be more
than one segment for the entire request, so we can special case
discard/write_same.

Signed-off-by: Kent Overstreet <[email protected]>
---
On Fri, Jan 31, 2014 at 09:17:25AM -0800, Hugh Dickins wrote:
> On Thu, 16 Jan 2014, Kent Overstreet wrote:
> >
> > Ok, I reread the code and figured it out - the analagous change also has to be
> > made in __blk_segment_map_sg(). I'll mail out a patch for this tomorrow after
> > I've stared at the code more and had less beer.
>
> I'd been hoping for a patch to try, but now your changes have hit Linus's
> tree: so today we have discard broken there too, crashing as originally
> reported on the NULL struct page pointer in __blk_recalc_rq_segments()'s
> page_to_pfn(bv.bv_page).
>
> How to reproduce it? I hope you'll find easier ways, but I get it with
> swapping to SSD (remember "swapon -d" to enable discard). I'm just doing
> what I've done for years, running a pair of make -j20 kbuilds to tmpfs in
> limited RAM (I use mem=700M with 1.5G of swap: but that would be far too
> little RAM for a general config of current tree), to get plenty of fairly
> chaotic swapping but good forward progress nonetheless (if the sizes are
> too small, then it'll just thrash abysmally or be OOM-killed).
>
> But please do send me a patch and I'll give it a try - thanks.

Hugh - can you give this patch a try? Passes my tests but I was never
able to reproduce your crash, unfortunately.

block/blk-merge.c | 91 +++++++++++++++++++++++++++++++++++++------------------
1 file changed, 62 insertions(+), 29 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 8f8adaa954..6c583f9c5b 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -21,6 +21,16 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
if (!bio)
return 0;

+ /*
+ * This should probably be returning 0, but blk_add_request_payload()
+ * (Christoph!!!!)
+ */
+ if (bio->bi_rw & REQ_DISCARD)
+ return 1;
+
+ if (bio->bi_rw & REQ_WRITE_SAME)
+ return 1;
+
fbio = bio;
cluster = blk_queue_cluster(q);
seg_size = 0;
@@ -161,30 +171,60 @@ new_segment:
*bvprv = *bvec;
}

-/*
- * map a request to scatterlist, return number of sg entries setup. Caller
- * must make sure sg can hold rq->nr_phys_segments entries
- */
-int blk_rq_map_sg(struct request_queue *q, struct request *rq,
- struct scatterlist *sglist)
+static int __blk_bios_map_sg(struct request_queue *q, struct bio *bio,
+ struct scatterlist *sglist,
+ struct scatterlist **sg)
{
struct bio_vec bvec, bvprv = { NULL };
- struct req_iterator iter;
- struct scatterlist *sg;
+ struct bvec_iter iter;
int nsegs, cluster;

nsegs = 0;
cluster = blk_queue_cluster(q);

- /*
- * for each bio in rq
- */
- sg = NULL;
- rq_for_each_segment(bvec, rq, iter) {
- __blk_segment_map_sg(q, &bvec, sglist, &bvprv, &sg,
- &nsegs, &cluster);
- } /* segments in rq */
+ if (bio->bi_rw & REQ_DISCARD) {
+ /*
+ * This is a hack - drivers should be neither modifying the
+ * biovec, nor relying on bi_vcnt - but because of
+ * blk_add_request_payload(), a discard bio may or may not have
+ * a payload we need to set up here (thank you Christoph) and
+ * bi_vcnt is really the only way of telling if we need to.
+ */
+
+ if (bio->bi_vcnt)
+ goto single_segment;
+
+ return 0;
+ }
+
+ if (bio->bi_rw & REQ_WRITE_SAME) {
+single_segment:
+ *sg = sglist;
+ bvec = bio_iovec(bio);
+ sg_set_page(*sg, bvec.bv_page, bvec.bv_len, bvec.bv_offset);
+ return 1;
+ }
+
+ for_each_bio(bio)
+ bio_for_each_segment(bvec, bio, iter)
+ __blk_segment_map_sg(q, &bvec, sglist, &bvprv, sg,
+ &nsegs, &cluster);

+ return nsegs;
+}
+
+/*
+ * map a request to scatterlist, return number of sg entries setup. Caller
+ * must make sure sg can hold rq->nr_phys_segments entries
+ */
+int blk_rq_map_sg(struct request_queue *q, struct request *rq,
+ struct scatterlist *sglist)
+{
+ struct scatterlist *sg = NULL;
+ int nsegs = 0;
+
+ if (rq->bio)
+ nsegs = __blk_bios_map_sg(q, rq->bio, sglist, &sg);

if (unlikely(rq->cmd_flags & REQ_COPY_USER) &&
(blk_rq_bytes(rq) & q->dma_pad_mask)) {
@@ -230,20 +270,13 @@ EXPORT_SYMBOL(blk_rq_map_sg);
int blk_bio_map_sg(struct request_queue *q, struct bio *bio,
struct scatterlist *sglist)
{
- struct bio_vec bvec, bvprv = { NULL };
- struct scatterlist *sg;
- int nsegs, cluster;
- struct bvec_iter iter;
-
- nsegs = 0;
- cluster = blk_queue_cluster(q);
-
- sg = NULL;
- bio_for_each_segment(bvec, bio, iter) {
- __blk_segment_map_sg(q, &bvec, sglist, &bvprv, &sg,
- &nsegs, &cluster);
- } /* segments in bio */
+ struct scatterlist *sg = NULL;
+ int nsegs;
+ struct bio *next = bio->bi_next;
+ bio->bi_next = NULL;

+ nsegs = __blk_bios_map_sg(q, bio, sglist, &sg);
+ bio->bi_next = next;
if (sg)
sg_mark_end(sg);

--
1.9.rc1

2014-02-04 12:26:17

by Hugh Dickins

[permalink] [raw]

Subject: Re: [PATCH] block: Explicitly handle discard/write same segments

On Tue, 4 Feb 2014, Kent Overstreet wrote:

> Immutable biovecs changed the way biovecs are interpreted - drivers no
> longer use bi_vcnt, they have to go by bi_iter.bi_size (to allow for
> using part of an existing segment without modifying it).
>
> This breaks with discards and write_same bios, since for those bi_size
> has nothing to do with segments in the biovec. So for now, we need a
> fairly gross hack - we fortunately know that there will never be more
> than one segment for the entire request, so we can special case
> discard/write_same.
>
> Signed-off-by: Kent Overstreet <[email protected]>
> ---
> On Fri, Jan 31, 2014 at 09:17:25AM -0800, Hugh Dickins wrote:
> > On Thu, 16 Jan 2014, Kent Overstreet wrote:
> > >
> > > Ok, I reread the code and figured it out - the analagous change also has to be
> > > made in __blk_segment_map_sg(). I'll mail out a patch for this tomorrow after
> > > I've stared at the code more and had less beer.
> >
> > I'd been hoping for a patch to try, but now your changes have hit Linus's
> > tree: so today we have discard broken there too, crashing as originally
> > reported on the NULL struct page pointer in __blk_recalc_rq_segments()'s
> > page_to_pfn(bv.bv_page).
> >
> > How to reproduce it? I hope you'll find easier ways, but I get it with
> > swapping to SSD (remember "swapon -d" to enable discard). I'm just doing
> > what I've done for years, running a pair of make -j20 kbuilds to tmpfs in
> > limited RAM (I use mem=700M with 1.5G of swap: but that would be far too
> > little RAM for a general config of current tree), to get plenty of fairly
> > chaotic swapping but good forward progress nonetheless (if the sizes are
> > too small, then it'll just thrash abysmally or be OOM-killed).
> >
> > But please do send me a patch and I'll give it a try - thanks.
>
> Hugh - can you give this patch a try? Passes my tests but I was never
> able to reproduce your crash, unfortunately.

Thanks a lot, Kent: I'm glad to report, this is working fine for me.

Hugh

>
> block/blk-merge.c | 91 +++++++++++++++++++++++++++++++++++++------------------
> 1 file changed, 62 insertions(+), 29 deletions(-)
>
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 8f8adaa954..6c583f9c5b 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -21,6 +21,16 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
> if (!bio)
> return 0;
>
> + /*
> + * This should probably be returning 0, but blk_add_request_payload()
> + * (Christoph!!!!)
> + */
> + if (bio->bi_rw & REQ_DISCARD)
> + return 1;
> +
> + if (bio->bi_rw & REQ_WRITE_SAME)
> + return 1;
> +
> fbio = bio;
> cluster = blk_queue_cluster(q);
> seg_size = 0;
> @@ -161,30 +171,60 @@ new_segment:
> *bvprv = *bvec;
> }
>
> -/*
> - * map a request to scatterlist, return number of sg entries setup. Caller
> - * must make sure sg can hold rq->nr_phys_segments entries
> - */
> -int blk_rq_map_sg(struct request_queue *q, struct request *rq,
> - struct scatterlist *sglist)
> +static int __blk_bios_map_sg(struct request_queue *q, struct bio *bio,
> + struct scatterlist *sglist,
> + struct scatterlist **sg)
> {
> struct bio_vec bvec, bvprv = { NULL };
> - struct req_iterator iter;
> - struct scatterlist *sg;
> + struct bvec_iter iter;
> int nsegs, cluster;
>
> nsegs = 0;
> cluster = blk_queue_cluster(q);
>
> - /*
> - * for each bio in rq
> - */
> - sg = NULL;
> - rq_for_each_segment(bvec, rq, iter) {
> - __blk_segment_map_sg(q, &bvec, sglist, &bvprv, &sg,
> - &nsegs, &cluster);
> - } /* segments in rq */
> + if (bio->bi_rw & REQ_DISCARD) {
> + /*
> + * This is a hack - drivers should be neither modifying the
> + * biovec, nor relying on bi_vcnt - but because of
> + * blk_add_request_payload(), a discard bio may or may not have
> + * a payload we need to set up here (thank you Christoph) and
> + * bi_vcnt is really the only way of telling if we need to.
> + */
> +
> + if (bio->bi_vcnt)
> + goto single_segment;
> +
> + return 0;
> + }
> +
> + if (bio->bi_rw & REQ_WRITE_SAME) {
> +single_segment:
> + *sg = sglist;
> + bvec = bio_iovec(bio);
> + sg_set_page(*sg, bvec.bv_page, bvec.bv_len, bvec.bv_offset);
> + return 1;
> + }
> +
> + for_each_bio(bio)
> + bio_for_each_segment(bvec, bio, iter)
> + __blk_segment_map_sg(q, &bvec, sglist, &bvprv, sg,
> + &nsegs, &cluster);
>
> + return nsegs;
> +}
> +
> +/*
> + * map a request to scatterlist, return number of sg entries setup. Caller
> + * must make sure sg can hold rq->nr_phys_segments entries
> + */
> +int blk_rq_map_sg(struct request_queue *q, struct request *rq,
> + struct scatterlist *sglist)
> +{
> + struct scatterlist *sg = NULL;
> + int nsegs = 0;
> +
> + if (rq->bio)
> + nsegs = __blk_bios_map_sg(q, rq->bio, sglist, &sg);
>
> if (unlikely(rq->cmd_flags & REQ_COPY_USER) &&
> (blk_rq_bytes(rq) & q->dma_pad_mask)) {
> @@ -230,20 +270,13 @@ EXPORT_SYMBOL(blk_rq_map_sg);
> int blk_bio_map_sg(struct request_queue *q, struct bio *bio,
> struct scatterlist *sglist)
> {
> - struct bio_vec bvec, bvprv = { NULL };
> - struct scatterlist *sg;
> - int nsegs, cluster;
> - struct bvec_iter iter;
> -
> - nsegs = 0;
> - cluster = blk_queue_cluster(q);
> -
> - sg = NULL;
> - bio_for_each_segment(bvec, bio, iter) {
> - __blk_segment_map_sg(q, &bvec, sglist, &bvprv, &sg,
> - &nsegs, &cluster);
> - } /* segments in bio */
> + struct scatterlist *sg = NULL;
> + int nsegs;
> + struct bio *next = bio->bi_next;
> + bio->bi_next = NULL;
>
> + nsegs = __blk_bios_map_sg(q, bio, sglist, &sg);
> + bio->bi_next = next;
> if (sg)
> sg_mark_end(sg);
>
> --
> 1.9.rc1

2014-02-04 12:35:43

by Kent Overstreet

[permalink] [raw]

Subject: Re: [PATCH] block: Explicitly handle discard/write same segments

Thanks!

On Tue, Feb 4, 2014 at 4:25 AM, Hugh Dickins <[email protected]> wrote:
> On Tue, 4 Feb 2014, Kent Overstreet wrote:
>
>> Immutable biovecs changed the way biovecs are interpreted - drivers no
>> longer use bi_vcnt, they have to go by bi_iter.bi_size (to allow for
>> using part of an existing segment without modifying it).
>>
>> This breaks with discards and write_same bios, since for those bi_size
>> has nothing to do with segments in the biovec. So for now, we need a
>> fairly gross hack - we fortunately know that there will never be more
>> than one segment for the entire request, so we can special case
>> discard/write_same.
>>
>> Signed-off-by: Kent Overstreet <[email protected]>
>> ---
>> On Fri, Jan 31, 2014 at 09:17:25AM -0800, Hugh Dickins wrote:
>> > On Thu, 16 Jan 2014, Kent Overstreet wrote:
>> > >
>> > > Ok, I reread the code and figured it out - the analagous change also has to be
>> > > made in __blk_segment_map_sg(). I'll mail out a patch for this tomorrow after
>> > > I've stared at the code more and had less beer.
>> >
>> > I'd been hoping for a patch to try, but now your changes have hit Linus's
>> > tree: so today we have discard broken there too, crashing as originally
>> > reported on the NULL struct page pointer in __blk_recalc_rq_segments()'s
>> > page_to_pfn(bv.bv_page).
>> >
>> > How to reproduce it? I hope you'll find easier ways, but I get it with
>> > swapping to SSD (remember "swapon -d" to enable discard). I'm just doing
>> > what I've done for years, running a pair of make -j20 kbuilds to tmpfs in
>> > limited RAM (I use mem=700M with 1.5G of swap: but that would be far too
>> > little RAM for a general config of current tree), to get plenty of fairly
>> > chaotic swapping but good forward progress nonetheless (if the sizes are
>> > too small, then it'll just thrash abysmally or be OOM-killed).
>> >
>> > But please do send me a patch and I'll give it a try - thanks.
>>
>> Hugh - can you give this patch a try? Passes my tests but I was never
>> able to reproduce your crash, unfortunately.
>
> Thanks a lot, Kent: I'm glad to report, this is working fine for me.
>
> Hugh
>
>>
>> block/blk-merge.c | 91 +++++++++++++++++++++++++++++++++++++------------------
>> 1 file changed, 62 insertions(+), 29 deletions(-)
>>
>> diff --git a/block/blk-merge.c b/block/blk-merge.c
>> index 8f8adaa954..6c583f9c5b 100644
>> --- a/block/blk-merge.c
>> +++ b/block/blk-merge.c
>> @@ -21,6 +21,16 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
>> if (!bio)
>> return 0;
>>
>> + /*
>> + * This should probably be returning 0, but blk_add_request_payload()
>> + * (Christoph!!!!)
>> + */
>> + if (bio->bi_rw & REQ_DISCARD)
>> + return 1;
>> +
>> + if (bio->bi_rw & REQ_WRITE_SAME)
>> + return 1;
>> +
>> fbio = bio;
>> cluster = blk_queue_cluster(q);
>> seg_size = 0;
>> @@ -161,30 +171,60 @@ new_segment:
>> *bvprv = *bvec;
>> }
>>
>> -/*
>> - * map a request to scatterlist, return number of sg entries setup. Caller
>> - * must make sure sg can hold rq->nr_phys_segments entries
>> - */
>> -int blk_rq_map_sg(struct request_queue *q, struct request *rq,
>> - struct scatterlist *sglist)
>> +static int __blk_bios_map_sg(struct request_queue *q, struct bio *bio,
>> + struct scatterlist *sglist,
>> + struct scatterlist **sg)
>> {
>> struct bio_vec bvec, bvprv = { NULL };
>> - struct req_iterator iter;
>> - struct scatterlist *sg;
>> + struct bvec_iter iter;
>> int nsegs, cluster;
>>
>> nsegs = 0;
>> cluster = blk_queue_cluster(q);
>>
>> - /*
>> - * for each bio in rq
>> - */
>> - sg = NULL;
>> - rq_for_each_segment(bvec, rq, iter) {
>> - __blk_segment_map_sg(q, &bvec, sglist, &bvprv, &sg,
>> - &nsegs, &cluster);
>> - } /* segments in rq */
>> + if (bio->bi_rw & REQ_DISCARD) {
>> + /*
>> + * This is a hack - drivers should be neither modifying the
>> + * biovec, nor relying on bi_vcnt - but because of
>> + * blk_add_request_payload(), a discard bio may or may not have
>> + * a payload we need to set up here (thank you Christoph) and
>> + * bi_vcnt is really the only way of telling if we need to.
>> + */
>> +
>> + if (bio->bi_vcnt)
>> + goto single_segment;
>> +
>> + return 0;
>> + }
>> +
>> + if (bio->bi_rw & REQ_WRITE_SAME) {
>> +single_segment:
>> + *sg = sglist;
>> + bvec = bio_iovec(bio);
>> + sg_set_page(*sg, bvec.bv_page, bvec.bv_len, bvec.bv_offset);
>> + return 1;
>> + }
>> +
>> + for_each_bio(bio)
>> + bio_for_each_segment(bvec, bio, iter)
>> + __blk_segment_map_sg(q, &bvec, sglist, &bvprv, sg,
>> + &nsegs, &cluster);
>>
>> + return nsegs;
>> +}
>> +
>> +/*
>> + * map a request to scatterlist, return number of sg entries setup. Caller
>> + * must make sure sg can hold rq->nr_phys_segments entries
>> + */
>> +int blk_rq_map_sg(struct request_queue *q, struct request *rq,
>> + struct scatterlist *sglist)
>> +{
>> + struct scatterlist *sg = NULL;
>> + int nsegs = 0;
>> +
>> + if (rq->bio)
>> + nsegs = __blk_bios_map_sg(q, rq->bio, sglist, &sg);
>>
>> if (unlikely(rq->cmd_flags & REQ_COPY_USER) &&
>> (blk_rq_bytes(rq) & q->dma_pad_mask)) {
>> @@ -230,20 +270,13 @@ EXPORT_SYMBOL(blk_rq_map_sg);
>> int blk_bio_map_sg(struct request_queue *q, struct bio *bio,
>> struct scatterlist *sglist)
>> {
>> - struct bio_vec bvec, bvprv = { NULL };
>> - struct scatterlist *sg;
>> - int nsegs, cluster;
>> - struct bvec_iter iter;
>> -
>> - nsegs = 0;
>> - cluster = blk_queue_cluster(q);
>> -
>> - sg = NULL;
>> - bio_for_each_segment(bvec, bio, iter) {
>> - __blk_segment_map_sg(q, &bvec, sglist, &bvprv, &sg,
>> - &nsegs, &cluster);
>> - } /* segments in bio */
>> + struct scatterlist *sg = NULL;
>> + int nsegs;
>> + struct bio *next = bio->bi_next;
>> + bio->bi_next = NULL;
>>
>> + nsegs = __blk_bios_map_sg(q, bio, sglist, &sg);
>> + bio->bi_next = next;
>> if (sg)
>> sg_mark_end(sg);
>>
>> --
>> 1.9.rc1