2002-09-03 17:47:40

by Linus Torvalds

[permalink] [raw]
Subject: One more bio for for floppy users in 2.5.33..


Ok,
I found another major bio-related bug that definitely explains why the
floppy driver generated corruption - any partial request completion would
be totally messed up by the BIO layer (not the floppy driver).

Any other block device driver that did partial request completion might
also be impacted.

I'm still looking at the floppy driver itself - some of the request
completion code is so messed up as to be unreadable, and some of that may
actually be due to trying to work around the bio problem (unsuccessfully,
I may add). So this may not actually fix things for you yet, simply
because the floppy driver itself still does some strange things.

Jens, oops. We should not update the counts by how much was left
uncompleted, but by how much we successfully completed!

Linus

----
# The following is the BitKeeper ChangeSet Log
# --------------------------------------------
# 02/09/03 [email protected] 1.582
# Major partial request completion boo-boo in the bio layer.
#
# This was _bad_. Major floppy corruption, and possibly the reason
# for other block device corruption for any driver that generated
# partial results for a block device request.
# --------------------------------------------
#
diff -Nru a/drivers/block/ll_rw_blk.c b/drivers/block/ll_rw_blk.c
--- a/drivers/block/ll_rw_blk.c Tue Sep 3 10:53:59 2002
+++ b/drivers/block/ll_rw_blk.c Tue Sep 3 10:53:59 2002
@@ -1992,11 +1992,11 @@
* not a complete bvec done
*/
if (unlikely(nsect > nr_sectors)) {
- int residual = (nsect - nr_sectors) << 9;
+ int partial = nr_sectors << 9;

- bio->bi_size -= residual;
- bio_iovec(bio)->bv_offset += residual;
- bio_iovec(bio)->bv_len -= residual;
+ bio->bi_size -= partial;
+ bio_iovec(bio)->bv_offset += partial;
+ bio_iovec(bio)->bv_len -= partial;
blk_recalc_rq_sectors(req, nr_sectors);
blk_recalc_rq_segments(req);
return 1;


2002-09-03 17:59:48

by Jens Axboe

[permalink] [raw]
Subject: Re: One more bio for for floppy users in 2.5.33..

On Tue, Sep 03 2002, Linus Torvalds wrote:
>
> Ok,
> I found another major bio-related bug that definitely explains why the
> floppy driver generated corruption - any partial request completion would
> be totally messed up by the BIO layer (not the floppy driver).
>
> Any other block device driver that did partial request completion might
> also be impacted.
>
> I'm still looking at the floppy driver itself - some of the request
> completion code is so messed up as to be unreadable, and some of that may
> actually be due to trying to work around the bio problem (unsuccessfully,
> I may add). So this may not actually fix things for you yet, simply
> because the floppy driver itself still does some strange things.
>
> Jens, oops. We should not update the counts by how much was left
> uncompleted, but by how much we successfully completed!

Yeah oops, the most embarassing thing is that Bart and I have both found
this but independently months ago but it seems it got lost at my end (or
your end, but lets not point fingers :-) :-(

Patch is ofcourse correct. I'm not sure other drivers have been hit (of
the used ones), since they would have to use old style completions and
do less than current_nr_sectors in one-go.

--
Jens Axboe

2002-09-03 20:53:10

by Mikael Pettersson

[permalink] [raw]
Subject: Re: One more bio for for floppy users in 2.5.33..

Linus Torvalds writes:
>
> Ok,
> I found another major bio-related bug that definitely explains why the
> floppy driver generated corruption - any partial request completion would
> be totally messed up by the BIO layer (not the floppy driver).
>
> Any other block device driver that did partial request completion might
> also be impacted.
>
> I'm still looking at the floppy driver itself - some of the request
> completion code is so messed up as to be unreadable, and some of that may
> actually be due to trying to work around the bio problem (unsuccessfully,
> I may add). So this may not actually fix things for you yet, simply
> because the floppy driver itself still does some strange things.

Confirmed. 2.5.33 + your two patches still corrupts data on a simple
dd to then from /dev/fd0 test.

/Mikael

2002-09-03 21:45:28

by Linus Torvalds

[permalink] [raw]
Subject: Re: One more bio for for floppy users in 2.5.33..


On Tue, 3 Sep 2002, Mikael Pettersson wrote:
>
> Confirmed. 2.5.33 + your two patches still corrupts data on a simple
> dd to then from /dev/fd0 test.

Ok, if you don't have BK, then here's the floppy driver end_request()
cleanup as a plain patch.

This passes dd tests for me, but they were by no means very exhaustive.

Linus

---
# The following is the BitKeeper ChangeSet Log
# --------------------------------------------
# 02/09/03 [email protected] 1.581.4.2
# Fix floppy driver end_request() handling - it used to do insane
# contortions instead of just calling "end_that_request_first()" with
# the proper sector count.
# --------------------------------------------
#
diff -Nru a/drivers/block/floppy.c b/drivers/block/floppy.c
--- a/drivers/block/floppy.c Tue Sep 3 14:51:09 2002
+++ b/drivers/block/floppy.c Tue Sep 3 14:51:09 2002
@@ -2295,16 +2295,15 @@
{
kdev_t dev = req->rq_dev;

- if (end_that_request_first(req, uptodate, req->hard_cur_sectors))
+ if (end_that_request_first(req, uptodate, current_count_sectors))
return;
add_blkdev_randomness(major(dev));
floppy_off(DEVICE_NR(dev));
blkdev_dequeue_request(req);
end_that_request_last(req);

- /* Get the next request */
- req = elv_next_request(QUEUE);
- CURRENT = req;
+ /* We're done with the request */
+ CURRENT = NULL;
}


@@ -2335,27 +2334,8 @@

/* unlock chained buffers */
spin_lock_irqsave(q->queue_lock, flags);
- while (current_count_sectors && CURRENT &&
- current_count_sectors >= req->current_nr_sectors){
- current_count_sectors -= req->current_nr_sectors;
- req->nr_sectors -= req->current_nr_sectors;
- req->sector += req->current_nr_sectors;
- end_request(req, 1);
- }
+ end_request(req, 1);
spin_unlock_irqrestore(q->queue_lock, flags);
-
- if (current_count_sectors && CURRENT) {
- /* "unlock" last subsector */
- req->buffer += current_count_sectors <<9;
- req->current_nr_sectors -= current_count_sectors;
- req->nr_sectors -= current_count_sectors;
- req->sector += current_count_sectors;
- return;
- }
-
- if (current_count_sectors && !CURRENT)
- DPRINT("request list destroyed in floppy request done\n");
-
} else {
if (rq_data_dir(req) == WRITE) {
/* record write error information */

2002-09-03 22:28:44

by Mikael Pettersson

[permalink] [raw]
Subject: Re: One more bio for for floppy users in 2.5.33..

Linus Torvalds writes:
>
> On Tue, 3 Sep 2002, Mikael Pettersson wrote:
> >
> > Confirmed. 2.5.33 + your two patches still corrupts data on a simple
> > dd to then from /dev/fd0 test.
>
> Ok, if you don't have BK, then here's the floppy driver end_request()
> cleanup as a plain patch.
>
> This passes dd tests for me, but they were by no means very exhaustive.

Success! With this patch and your previous two the floppy driver
passes all tests I've thrown at it, including raw data access,
VFS access with ext2, installing lilo, fsck, and booting the result.

Thanks.

/Mikael

2002-09-04 07:21:38

by Suparna Bhattacharya

[permalink] [raw]
Subject: Re: One more bio for for floppy users in 2.5.33..

On Tue, 03 Sep 2002 23:44:59 +0530, Jens Axboe wrote:

> On Tue, Sep 03 2002, Linus Torvalds wrote:
>>
>> Ok,
>> I found another major bio-related bug that definitely explains why the
>> floppy driver generated corruption - any partial request completion
>> would be totally messed up by the BIO layer (not the floppy driver).
>>
>> Any other block device driver that did partial request completion might
>> also be impacted.
>>
>> Jens, oops. We should not update the counts by how much was left
>> uncompleted, but by how much we successfully completed!
>
> Yeah oops, the most embarassing thing is that Bart and I have both found
> this but independently months ago but it seems it got lost at my end (or
> your end, but lets not point fingers :-) :-(
>

Oh yes, even I had this fixed this in the bio traversal patches
I had posted (had this in the core patch, and mentioned it
in the description in the note :) ), guess it went unnoticed.

BTW, any plans for including those patches ? So far all feedback
I've received (including Jens, Bart, James, Andrew ) seems to
say OK to go from what I can tell. If it seems like the right
thing to do, then I'd rather it go in sooner (at least the core
and comatibility portions), so subsequent driver changes etc
are aligned with this (I've been chasing several kernel versions
with this now :( )

I now have a latest version updated to 2.5.33, but dropped the
IDE portions from it for now, in view of the ide switch and
ongoing changes ... If Bart upgrades his corresponding patches
then that would take care of it. Otherwise I could give that
a go too you think that seems worthwhile and may relook at
whether the "traverse for submission" helpers I have can be
improved upon (maybe given better names too).

A quick recap:

BIO traversal enhancements
--------------------------
- Pre-req for full BIO splitting infrastructure [Splits
in the middle of a segment can also share the same vec]
- Pre-req for attaching a pre-built vector (not owned by block)
to a bio (e.g for aio) [Avoids certain subtle side-effects in
special cases like partial segment completion]
- Pre-req for IDE mult-count PIO write fixes w/o request copy
[Ability to traverse a bio partially for i/o submission without
effecting bio completion]

Introduces some new fields in the bio and request to help maintain
traversal state and handle partial segment bio clones, and comes
with a corresponding set of helpers to enable clean separation
of traversal for i/o submission vs i/o completion.

Regards
Suparna

> Patch is ofcourse correct. I'm not sure other drivers have been hit (of
> the used ones), since they would have to use old style completions and
> do less than current_nr_sectors in one-go.
>

2002-09-04 16:32:34

by Linus Torvalds

[permalink] [raw]
Subject: Re: One more bio for for floppy users in 2.5.33..


On Wed, 4 Sep 2002, Suparna Bhattacharya wrote:
>
> Oh yes, even I had this fixed this in the bio traversal patches
> I had posted (had this in the core patch, and mentioned it
> in the description in the note :) ), guess it went unnoticed.

Well, I've never seen a "this should go in" about it.

Also, it was apparently mixed up with the "bio splitup" stuff, which was
discussed at least with Jens, and I feel strongly that we shouldn't split,
we should build up. Jens was working on exactly that.

In other words, I absolutely hate the fact that a major bug-fix was
(a) not marked as such and sent to me
and
(b) mixed up with experimental work for other drivers

Even now (assuming I hadn't fixed it on my own), I would have preferred to
get that fix separately, as it would have impacted the floppy driver, for
example (the fix broke the floppy driver even more than it was before,
because the floppy driver was stupidly trying to work around the original
bug by hand).

Imagine what a horror it is to figure out why a large experimental patch
breaks an existing driver? My first reaction would have been to just throw
the large new patch away, since it obviously broke the floppy even more.
Instead, if I had been passed the bug-fix only, and the floppy had broken
worse that it was originally, it would have been absolutely _obvious_
where the problem was.

In short: please please PLEASE keep fixes to existing code separate from
new stuff. It makes everything easier, and I have absolutely no problems
with applying "obvious fixes" even if they might break something else.

In contrast, the new stuff I really don't know if it should go in at all,
considering that it's trivial (and CPU-efficient) to build up legal bio
request on the fly and _not_ depend on splitting them later (or at least
making splitting a special thing only used by things like MD and other
such indirection layers).

Linus

2002-09-05 06:59:14

by Suparna Bhattacharya

[permalink] [raw]
Subject: Re: One more bio for for floppy users in 2.5.33..

On Wed, Sep 04, 2002 at 09:36:42AM -0700, Linus Torvalds wrote:
>
> On Wed, 4 Sep 2002, Suparna Bhattacharya wrote:
> >
> > Oh yes, even I had this fixed this in the bio traversal patches
> > I had posted (had this in the core patch, and mentioned it
> > in the description in the note :) ), guess it went unnoticed.
>
> Well, I've never seen a "this should go in" about it.
>
> Also, it was apparently mixed up with the "bio splitup" stuff, which was

True, that fix ought to have gone in as a separate patch. As Jens said,
it was like we all knew it had to be fixed, but maybe no one quite saw
the urgency till the case that used it came about.

Barthlomeij actually raised another point about the situation when the
partial completion happens with an error (i.e. not uptodate case), a situation
which doesn't automatically get handled correctly. Something to think
about ?

> discussed at least with Jens, and I feel strongly that we shouldn't split,
> we should build up. Jens was working on exactly that.
>
> In other words, I absolutely hate the fact that a major bug-fix was
> (a) not marked as such and sent to me
> and
> (b) mixed up with experimental work for other drivers

I agree. The fix was a matter on the side. I'd expected it to have gone
in otherwise anyway, since it wasn't something new I'd discovered. It
was perhaps just so obvious that it didn't happen so far !

I couldn't resist bringing it up in this context, because I really wanted
some critical feedback about whether the approach seemed right, or
whether I should just let the patch lie by the side till someone found a use
for it, without having feeling guilty about having abandoned it and not
following it through enough. At least at Ottawa it sounded like this was
something people were interested in, and that was the whole reason for
my doing in the first place. (Will get to the bio splitting point later
down the line).

After all, getting all drivers fixed up to adhere to this would require
some work, something I know I can't do on my own (I just
don't understand the subtleties that someone more familiar which them
would), and which won't be right to cause others to handle unless people
felt it was the correct and useful thing to do.

I don't even mind dropping it if it seems like a "not now" or "not right"
thing (I already seem to spent more effort on this then I probably should
have), but at least after a bit of discussion to take it to that point.

>
> Even now (assuming I hadn't fixed it on my own), I would have preferred to
> get that fix separately, as it would have impacted the floppy driver, for
> example (the fix broke the floppy driver even more than it was before,
> because the floppy driver was stupidly trying to work around the original
> bug by hand).
>
> Imagine what a horror it is to figure out why a large experimental patch
> breaks an existing driver? My first reaction would have been to just throw
> the large new patch away, since it obviously broke the floppy even more.
> Instead, if I had been passed the bug-fix only, and the floppy had broken
> worse that it was originally, it would have been absolutely _obvious_
> where the problem was.
>
> In short: please please PLEASE keep fixes to existing code separate from
> new stuff. It makes everything easier, and I have absolutely no problems
> with applying "obvious fixes" even if they might break something else.

OK.

>
> In contrast, the new stuff I really don't know if it should go in at all,
> considering that it's trivial (and CPU-efficient) to build up legal bio
> request on the fly and _not_ depend on splitting them later (or at least
> making splitting a special thing only used by things like MD and other
> such indirection layers).

This is exactly the kind of feedback or discussion I need.

The patch actually addresses a set of problems and not just bio splitting.

Ideally I'd have liked to separately handle different problems differently
in separate patches. In fact that's the way I had started out at first,
only to realise that things become much simpler or cleaner to implement,
if we treat this as a whole (excluding the bug fix above) as a basic
traversal logic change.

The problems are like this:

Today partial segment completion of a bio affects the bv_offset and bv_len
fields in the vecs themselves, because there is nothing in the bio to
remember how far we are inside a current segment (and as deriving that by
calculating it backwards from bi_size and bi_vcnt could be inefficient).
This means that a higher layer which has passed a vector to bio can't
assume that the vec entries would be valid after the i/o is completed.
So though we have a uniform vec abstraction, passing it around subsystems
seamlessly may throw up some unexpected results. (Just that are likely
not to hit this often, which is why we don't hear complaints about it)

Today, it is hard to handle partial submission of a request/bio where
partial submission state has to be remembered across function invokations
(or interrupts) without making a copy of the request _and_ its corresponding
bios (together with vectors), since the block layer traversal functions
combine traversal and completion without a clear concept of separation
between submission and completion state within a request. IDE mult-count
PIO is of course a case in point. Not sure if there are others.

Today a bio cannot be split in the middle of a segment, which means that
MD or other such indirection layers need to do their own bvec allocations
in such cases. Once we have bio with has a different start offset from the
first bvec's start, we lose some simplicity (which is why I'm not
sure if all drivers necessarily need to support such bio s the first time
around). But with such a field in the bio (bi_voffset), we can use
that as a moving counter during traversal. (BTW, an extra field may not
have been indispensable if we didn't need to separate submission and
completion state, e.g. rq->current_nr_sectors might work for just this
purpose as well, and that was what I was planning to try before Jens
suggested going ahead with bi_voffset, but then found it handy to
manage completion state separately from submission state).

BTW, I agree splitting truly becomes an issue mainly in the
case of such layered drivers, in other cases it is a good idea to build
before hand; but for indirection drivers it _is_ an issue ... one can't expect
to predict all situations and that would limit flexibility there.

Another point that I've always wanted to bring up is that the moment we
get to a need to divide an i/o into separate requests either to a single
or multiple drivers, we _are_ doing an allocation of the request struct
(using a slot in the pool of available requests for that queue). Building
the right bio sizes doesn't do away with this, does it ?

In any case, special case problem situations can always be dealt with
special case code, but since we've redesigned the block i/o layer
the question is whether we've got our abstractions right to deal with
various requirements at the appropriate level and avoid workarounds
by drivers or callers into block.

So rather than look at the above problems individually and frequency
of situations where we can imagine they'd occur in practice, could
you think about the basics a bit and let me know what you think ?

(a) Not the right away to go at all ?
(b) Think about it later, not now ?
(c) Try and get more pieces working correctly with this and then see ?

One concern that I have with the patches I posted (I did split them
up last time around, so they look smaller) is the increase in number
of counters to update / track. This also has the effect of requiring
updates happen via helpers rather than drivers individually managing
these counters which is not necessarily a bad thing in itself, but
I have to say that I was a trifle worried about potential
unexpected side effects in the short run for drivers that make their
own assumptions about these. I was hoping that driver maintainers
would be able to help sound an alert for any issues they see right
away.

Regards
Suparna

>
> Linus
>

2002-09-05 15:01:55

by Linus Torvalds

[permalink] [raw]
Subject: Re: One more bio for for floppy users in 2.5.33..


[ Suparna, I'll only react to one thing in your email right now, I'll try
to take the time to look at your _real_ questions later after I've had
my coffee and woken up.. ]

On Thu, 5 Sep 2002, Suparna Bhattacharya wrote:
>
> Barthlomeij actually raised another point about the situation when the
> partial completion happens with an error (i.e. not uptodate case), a situation
> which doesn't automatically get handled correctly. Something to think
> about ?

Good point. Yes. However, it's hard to pass the errors down, since we've
largely lost the individual parts of the bio (ie people don't even use the
buffer_heads any more.

There's a somewhat related issue with bio's, namely that partial
_successful_ completion also doesn't notify anybody, which can suck from a
latency standpoint if there are big delays in the partial IO (even if it
is all successful). That's certainly true of the floppy driver, for
example, where we can build up a 64kB bio due to read-ahead and then it
takes many milliseconds between partial completions (which are done one
track at a time in the absolute best case).

Note that this actually makes read-ahead much less effective, because it
means that we're not doing work while the read-ahead is happening: the
read-ahead improves throughput by doing big blocks at a time, but it does
_not_ get the improvement that it used to get of having asynchronous IO
going on.

We should wake up the person that maybe only needed the first page.

But right now, we've kind of lost that ability, because the bio itself
does not contain any such information. We used to have a list of buffer
heads in the request, and could wake them up one by one, but..

I would suggest:

- add a "nr of sectors completed" argument to the "bi_end_io()" function,
so that it looks like

void xxx_bio_end_io(struct bio *bio, unsigned long completed)
{
/*
* Old completion handlers that don't understand it
* should just return immediately for partial bio
* completion notifiers
*/
if (bio->b_size)
return;
...
}

which would allow things like mpage_end_io_read() to unlock pages as
they complete, instead of unlocking them all in one go.

Comments? It looks trivial to do this from a bio level, and it would not
be hard to update the existing end_io functions (because you can always
just update them with the one-liner "if not totally done, return" to get
the old behaviour).

Andrew? I really dislike the lack of concurrency in our current mpage
read-ahead thing. Whaddayathink?

Linus

2002-09-05 16:09:08

by Andrew Morton

[permalink] [raw]
Subject: Re: One more bio for for floppy users in 2.5.33..

Linus Torvalds wrote:
>
> ...
> I would suggest:
>
> - add a "nr of sectors completed" argument to the "bi_end_io()" function,
> so that it looks like
>
> void xxx_bio_end_io(struct bio *bio, unsigned long completed)
> {
> /*
> * Old completion handlers that don't understand it
> * should just return immediately for partial bio
> * completion notifiers
> */
> if (bio->b_size)
> return;
> ...
> }
>
> which would allow things like mpage_end_io_read() to unlock pages as
> they complete, instead of unlocking them all in one go.

It's a feature! We don't want to have to soak up 20,000 context
switches a second just reading a file from an 80MB/sec disk.

> Comments? It looks trivial to do this from a bio level, and it would not
> be hard to update the existing end_io functions (because you can always
> just update them with the one-liner "if not totally done, return" to get
> the old behaviour).
>
> Andrew? I really dislike the lack of concurrency in our current mpage
> read-ahead thing. Whaddayathink?

Well it is supposed to lay out two BIOs, but if that happens, it's
fragile - it relies on BIO_MAX_SIZE=64k and default readahead=128k.

What I think we need to do here is to get Jens' bio_add_page() stuff
up and running and to then go through the device drivers and set their
max BIO size to something which is inversely proportional to the
device's expected bandwidth.

This way, the floppy readahead would lay out 1- or 2-page BIOs, and
the honking FC array would lay out 128k or larger BIOs.

(In fact, I would prefer that the device's nominal read- and write-
bandwidths be made available in some manner. This way the VM can
make these granularity-size decisions for readahead, and the VM
can then solve the machine-full-of-dirty-memory-against-a-slow-device
problem. But the non-blocking page reclaim code probably solves that
too).

2002-09-05 16:58:25

by Linus Torvalds

[permalink] [raw]
Subject: Re: One more bio for for floppy users in 2.5.33..


On Thu, 5 Sep 2002, Andrew Morton wrote:
> Linus Torvalds wrote:
> >
> > ...
> > I would suggest:
> >
> > - add a "nr of sectors completed" argument to the "bi_end_io()" function,
> > so that it looks like
> >
> > void xxx_bio_end_io(struct bio *bio, unsigned long completed)
> > {
> > /*
> > * Old completion handlers that don't understand it
> > * should just return immediately for partial bio
> > * completion notifiers
> > */
> > if (bio->b_size)
> > return;
> > ...
> > }
> >
> > which would allow things like mpage_end_io_read() to unlock pages as
> > they complete, instead of unlocking them all in one go.
>
> It's a feature! We don't want to have to soak up 20,000 context
> switches a second just reading a file from an 80MB/sec disk.

You didn't think it through.

The current behaviour is a BUG.

A fast disk driver will _never ever_ do a partial request completion. A
high-performance subsystem will put in the scatter-gather list and say
"go" to the controller, and the controller will send exactly one interrupt
back when it is all done.

So for such a system, you'd never see partial completions anyway.

Partial completions are a feature of slow hardware. And slow hardware is
exactly when we want to know about it.

So my approach has no downsides, only upsides.

Linus

2002-09-05 17:57:45

by Andrew Morton

[permalink] [raw]
Subject: Re: One more bio for for floppy users in 2.5.33..

Linus Torvalds wrote:
>
> ...
> A fast disk driver will _never ever_ do a partial request completion. A
> high-performance subsystem will put in the scatter-gather list and say
> "go" to the controller, and the controller will send exactly one interrupt
> back when it is all done.

OK. But still, I don't see why we need partial BIO completions. If
we say that the basic unit of completion is a whole BIO, then readahead
can then manage latency via the outgoing bio size.

> So for such a system, you'd never see partial completions anyway.
>
> Partial completions are a feature of slow hardware. And slow hardware is
> exactly when we want to know about it.

Well I'd be interested in knowing specifically what is wrong with the
behaviour of 2.5.33 against a floppy disk.

In the testing I did a few weeks back, everything checked out. An
application which was reading the raw device at 95% of media bandwidth
never blocked. An application which was capable of processing data at
120% of media bandwidth achieved 100%.

It could be that the initial 64k read at the start-of-file is
too big, and the many-small-file behaviour is poor?

A specific "this sucks" testcase would be helpful...

2002-09-05 18:21:13

by Linus Torvalds

[permalink] [raw]
Subject: Re: One more bio for for floppy users in 2.5.33..


On Thu, 5 Sep 2002, Andrew Morton wrote:
>
> OK. But still, I don't see why we need partial BIO completions. If
> we say that the basic unit of completion is a whole BIO, then readahead
> can then manage latency via the outgoing bio size.

But that's horrible. The floppy driver can take huge bio's no problem, and
limiting bio sizes to track sizes would be a huge pain in the driver for
no good reason. In fact, it would be pretty much impossible, since the
tracks aren't even page-aligned.

So limiting bio's fundamentally _cannot_ do the right thing. While adding
two lines of code _can_.

> Well I'd be interested in knowing specifically what is wrong with the
> behaviour of 2.5.33 against a floppy disk.

Try it (not plain 2.5.33, but current BK where floppy actually works). It
works, but reading a single sector will pause until it has read several
tracks. Even though the sector came back much earlier.

Also, you missed the error case argument. Right now we _cannot_ handle
error cases for partial transfers AT ALL. The bio interface simply does
not support it. And there is no way you can fix that with the current
interface. While the partial completion case allows for at least partial
recovery.

Andrew, give it up. The current interface _sucks_.

> In the testing I did a few weeks back, everything checked out. An
> application which was reading the raw device at 95% of media bandwidth
> never blocked. An application which was capable of processing data at
> 120% of media bandwidth achieved 100%.

And an application that only wanted one sector? To notice that the
filesystem is of the wrong type, for example?

Throughput is _secondary_ to many latency concerns. And you cannot fix the
latency with full bio's (see the track issue).

Linus

2002-09-05 18:39:57

by Andrew Morton

[permalink] [raw]
Subject: Re: One more bio for for floppy users in 2.5.33..

Linus Torvalds wrote:
>
> On Thu, 5 Sep 2002, Andrew Morton wrote:
> >
> > OK. But still, I don't see why we need partial BIO completions. If
> > we say that the basic unit of completion is a whole BIO, then readahead
> > can then manage latency via the outgoing bio size.
>
> But that's horrible. The floppy driver can take huge bio's no problem, and
> limiting bio sizes to track sizes would be a huge pain in the driver for
> no good reason. In fact, it would be pretty much impossible, since the
> tracks aren't even page-aligned.

erp. Yes, a BIO can span several tracks. I see the point.

> ...
> > In the testing I did a few weeks back, everything checked out. An
> > application which was reading the raw device at 95% of media bandwidth
> > never blocked. An application which was capable of processing data at
> > 120% of media bandwidth achieved 100%.
>
> And an application that only wanted one sector? To notice that the
> filesystem is of the wrong type, for example?

OK. That's the initial-64k thing. If you read 512 bytes from /dev/fd0,
readahead will go and issue a 64k request, and your 512-byte request takes
ages.

> Throughput is _secondary_ to many latency concerns. And you cannot fix the
> latency with full bio's (see the track issue).

The `nr_of_sectors_completed' would be tricky to handle - we don't know how
to bring the pagecache pages uptodate in response to a 512-byte completion.
We'd have to teach the pagecache read functions to permit userspace to read
from non-uptodate pages by looking at the buffer_head state. And then
look at buffer_req to distinguish "this part of the page hasn't been read yet"
from "this part of the page had an IO error". Ick.

It would be simpler if it was nr_of_pages_completed.

2002-09-05 18:30:48

by Linus Torvalds

[permalink] [raw]
Subject: Re: One more bio for for floppy users in 2.5.33..


On Thu, 5 Sep 2002, Jens Axboe wrote:
>
> However, I don't see how this is a two-liner change. Basically you are
> changing bi_end_io() from a completion to partial completion invokation,

Right. So we'll add one

bio->bi_end_io(bio, nr_sectors)

to the partial bio completion case in "finish_that_request_first()".

That's one line.

The other line is

if (bio->bi_size) return;

which has to be added to the end-request thing.

(yeah, yeah, there are several end-request functions, and we need to fix
up the function prototypes too etc, but the point is that the actual
_work_ is just two real lines of new code).

Linus

2002-09-05 18:26:52

by Jens Axboe

[permalink] [raw]
Subject: Re: One more bio for for floppy users in 2.5.33..

On Thu, Sep 05 2002, Linus Torvalds wrote:
> > OK. But still, I don't see why we need partial BIO completions. If
> > we say that the basic unit of completion is a whole BIO, then readahead
> > can then manage latency via the outgoing bio size.
>
> But that's horrible. The floppy driver can take huge bio's no problem, and
> limiting bio sizes to track sizes would be a huge pain in the driver for
> no good reason. In fact, it would be pretty much impossible, since the
> tracks aren't even page-aligned.
>
> So limiting bio's fundamentally _cannot_ do the right thing. While adding
> two lines of code _can_.

I agree that partial completions are the right thing to do here, and in
fact this is how the interface was originally remember?

However, I don't see how this is a two-liner change. Basically you are
changing bi_end_io() from a completion to partial completion invokation,
which requires changing (and complicating) all of them. Just adding
a sector count to bio_endio() does not enable that to partially complete
some pages. What am I missing?

Jens

2002-09-05 18:33:36

by Jens Axboe

[permalink] [raw]
Subject: Re: One more bio for for floppy users in 2.5.33..

On Thu, Sep 05 2002, Linus Torvalds wrote:
>
> On Thu, 5 Sep 2002, Jens Axboe wrote:
> >
> > However, I don't see how this is a two-liner change. Basically you are
> > changing bi_end_io() from a completion to partial completion invokation,
>
> Right. So we'll add one
>
> bio->bi_end_io(bio, nr_sectors)
>
> to the partial bio completion case in "finish_that_request_first()".
>
> That's one line.
>
> The other line is
>
> if (bio->bi_size) return;
>
> which has to be added to the end-request thing.
>
> (yeah, yeah, there are several end-request functions, and we need to fix
> up the function prototypes too etc, but the point is that the actual
> _work_ is just two real lines of new code).

Ah ok, then we completely agree on what needs doing :-)

And yes, this is 100% identical to the earlier bio code (I forget what
revisions, and that was prior to BK I think).

Jens

2002-09-05 18:56:15

by Linus Torvalds

[permalink] [raw]
Subject: Re: One more bio for for floppy users in 2.5.33..


On Thu, 5 Sep 2002, Andrew Morton wrote:
>
> It would be simpler if it was nr_of_pages_completed.

Well.. Maybe.

I actually think that you in practice really only have two cases:

- we only care about full completion. Easily tested for by just looking
at bi_size, and leaving the code as it is right now.

- the code really cares about and uses the incremental information. At
which point it will already have "handled" any previous incremental
stuff, and the only thing it really cares about is the "new increment".

So I'd suggest making at least the first cut only have the incremental
information, since the global information already exists through bi_size.

Linus

2002-09-05 19:42:41

by Peter Osterlund

[permalink] [raw]
Subject: Re: One more bio for for floppy users in 2.5.33..

Jens Axboe <[email protected]> writes:

> And yes, this is 100% identical to the earlier bio code (I forget what
> revisions, and that was prior to BK I think).

It changed in 2.5.5-pre1:

<[email protected]> (02/02/11 1.262.3.11)
Remove nr_sectors from bio_end_io end I/O callback. It was a relic
from when completion was potentially called more than once to indicate
partial end I/O. These days bio->bi_end_io is _only_ called when I/O
has completed on the entire bio.

--
Peter Osterlund - [email protected]
http://w1.894.telia.com/~u89404340

2002-09-05 20:12:12

by Andrew Morton

[permalink] [raw]
Subject: Re: One more bio for for floppy users in 2.5.33..

Linus Torvalds wrote:
>
> On Thu, 5 Sep 2002, Andrew Morton wrote:
> >
> > It would be simpler if it was nr_of_pages_completed.
>
> Well.. Maybe.
>
> I actually think that you in practice really only have two cases:
>
> - we only care about full completion. Easily tested for by just looking
> at bi_size, and leaving the code as it is right now.

OK.

> - the code really cares about and uses the incremental information. At
> which point it will already have "handled" any previous incremental
> stuff, and the only thing it really cares about is the "new increment".

So the BIO client would need to keep some state somewhere about "this is
the next page to unlock". That would best be in the BIO somewhere.

(I assume the block layer handles the case where the hardware signals
completion in non-sector-ascending-order? That must have been fun to
code and test).

> So I'd suggest making at least the first cut only have the incremental
> information, since the global information already exists through bi_size.

Well we still will have the problem where reading 512 bytes from /dev/fd0
does 64k of IO. That is most sweet for reading a bunch of 32k ext2 files
from a hard drive, not so nice for reading fd0's partition table.

We can do all sorts of things by dropping parameters, hints and tunables
into q->backing_dev_info. We just need to work out what is sensible for
this. ummm. I guess backing_dev_info.read_k_per_sec would suffice.

Or .i_am_a_floppy ;)

2002-09-05 20:20:12

by Linus Torvalds

[permalink] [raw]
Subject: Re: One more bio for for floppy users in 2.5.33..


On Thu, 5 Sep 2002, Andrew Morton wrote:
>
> > - the code really cares about and uses the incremental information. At
> > which point it will already have "handled" any previous incremental
> > stuff, and the only thing it really cares about is the "new increment".
>
> So the BIO client would need to keep some state somewhere about "this is
> the next page to unlock". That would best be in the BIO somewhere.

Well, the information actually is there already, although I have to admit
that it's kind of subtle.

Look at the current mpage_end_io_read(), and realize that it already does
a traversal in pages from

start = bio->bi_io_vec

end = bio->bi_io_vec + bio->bi_vcnt - 1

which is actually very close to what it would do with partial results.

With partial results, it would need to do only a slightly different
traversal:

end = bio->bi_io_vec + bio->bi_vcnt*PAGE_SIZE - bio->bi_size

start = end - nr_sectors * 512

PAGE_ALIGN(start)
PAGE_ALIGN(end)

but it's otherwise the exact same code (doing all the edge calculations in
bytes, and then only traversing pages that have now been fully done and
weren't fully done last time).

It _looks_ like it literally needs just a few lines of changes.

So I would actually argue against adding new information: we _do_ actually
have the information already, and adding more "convenient" forms of it
adds more aliasing and coherency issues. The current form isn't _that_
complicated to use, and we might hide it behind a simple macro:

#define GET_PAGE_INDEXES(bio, start, end) \
... set start/end to point into the ...
... proper bio->bi_io_vec subarray ...

> Well we still will have the problem where reading 512 bytes from /dev/fd0
> does 64k of IO. That is most sweet for reading a bunch of 32k ext2 files
> from a hard drive, not so nice for reading fd0's partition table.

I do think we might make the read-ahead window configurable, and make slow
devices have slightly smaller windows.

On the other hand, I don't think the 64kB IO actually _hurts_ per se, as
long as it doesn't delay the stuff we care about.

Linus

2002-09-05 20:30:44

by Linus Torvalds

[permalink] [raw]
Subject: Re: One more bio for for floppy users in 2.5.33..


On Thu, 5 Sep 2002, Linus Torvalds wrote:
>
> With partial results, it would need to do only a slightly different
> traversal:
>
> end = bio->bi_io_vec + bio->bi_vcnt*PAGE_SIZE - bio->bi_size
>
> start = end - nr_sectors * 512
>
> PAGE_ALIGN(start)
> PAGE_ALIGN(end)
>
> but it's otherwise the exact same code (doing all the edge calculations in
> bytes, and then only traversing pages that have now been fully done and
> weren't fully done last time).
>
> It _looks_ like it literally needs just a few lines of changes.

Ok, so we now have two "few lines of code" changes. Who wants to actually
_do_ these? I'll do it if nobody else wants to, but I'd much rather see
somebody else _do_ want to do this and test it out and just send me a
tested patch ;)

Linus

2002-09-05 20:41:19

by Linus Torvalds

[permalink] [raw]
Subject: Re: One more bio for for floppy users in 2.5.33..


Btw, the update to do partial completion will need a few more fixes: right
now the different callers of "bio->bi_end_io(bio)" are not very careful
about updating the bio information, since no bi_end_io() function has had
any reason to care before.

That turns the 2-liner patch into slightly more, since for example the
failure cases in __make_request() need to make sure that they pass in the
right size/sector count. So the

....
end_io:
bio->bi_end_io(bio);
return 0;

would become something like

....
end_io:
bio->bi_size = 0;
bio->bi_end_io(bio, nr_sectors);
return 0;

if we had this interface.

To avoid those kinds of silly bugs and to avoid havind the bi_end_io()
function have to look up all the bio information, maybe the end_io calling
convention really should be

void bio_end_io(struct bio *bio,
unsigned int completed,
unsigned int left,
unsigned int uptodate);

and then a failure would just be

bio->bi_end_io(bio, nr_sectors, 0, 0);

and the end-io function would have all the information it needs to decide
how much has been done / is left undone directly in the arguments.

One final question would be whether we would want to make all of these
byte counts, for some future networked block device where we might be
getting the completions back in odd-sized chunks, for example? Right now
much of the bio code already is able to handle byte-sized stuff, and it
might be nice to not have to maintain byte counts in NBD if the bio layer
already does it anyway..

Linus

2002-09-06 06:42:19

by Helge Hafting

[permalink] [raw]
Subject: Re: One more bio for for floppy users in 2.5.33..

Linus Torvalds wrote:
[...]
> I do think we might make the read-ahead window configurable, and make slow
> devices have slightly smaller windows.
>
> On the other hand, I don't think the 64kB IO actually _hurts_ per se, as
> long as it doesn't delay the stuff we care about.

I can think of one case where large readahead hurts for floppy, even
with partial completion:

1. Grab a stack of floppies
2. Try mounting (or mount+ls) one after another,
in search of the right one.

You'll get the results on screen fast enough
(mount succeeded/failed or ls showed the right/wrong files)
but when it is the wrong floppy you have to wait for
several tracks to read before you may eject and try
the next one.

Sure, it is only reading so ejecting "by force" won't
hurt the fs but then you have to wait on io errors instead.

So I think a smaller readahead might make sense for floppies,
unless people don't do this sort of search anymore. I don't.

Helge Hafting

2002-09-06 06:54:07

by Linus Torvalds

[permalink] [raw]
Subject: Re: One more bio for for floppy users in 2.5.33..


On Fri, 6 Sep 2002, Helge Hafting wrote:
>
> I can think of one case where large readahead hurts for floppy, even
> with partial completion:
>
> 1. Grab a stack of floppies
> 2. Try mounting (or mount+ls) one after another,
> in search of the right one.

Note that the delay for motor on/off is _much_ larger than the actual
delay for seeking.

The seek itself is on the order of a few ms, with the head settle time
being in the tens (possibly even a few hundred) ms per track. So assuming
you end up reading 4 tracks or so due to readahead, that's still in the
range of about one second.

In contrast, the motor on/off time is something like 5 seconds if I
remember correctly. Of course, you can certainly eject the floppy while
the motor is still running, but I'd suggest against it.

> So I think a smaller readahead might make sense for floppies,
> unless people don't do this sort of search anymore. I don't.

I do agree that a 64kB readahead is likely to be excessive on a floppy,
I'm just saying that I doubt it will be all that noticeable in most cases.

The absolute worst case is when opening, reading a sector, and closing
again several times in succession, at which point right now we'll end up
serializing. But even at 64kB, that's going to be faster than most people
can change floppies if they actually want to even glance at what the
contents are.

The reason I want the first sectors to be returned early is that I thought
it was quite noticeable to do just a simple "ls" on the floppy when I
tested. Of course, that may be just me: it's literally been several years
since I really used floppies, and maybe they really always were that slow.
But I thought the root directory was on track zero, so it _should_ return
it first thing.

Oh, well. I don't seem to be the only one who doesn't use the dang things
any more. The floppy driver has been broken in 2.5.x for half a year or
whatever, and there weren't _that_ many people who ever even mentioned it.

Linus

2002-09-07 11:12:03

by Daniel Phillips

[permalink] [raw]
Subject: Re: One more bio for for floppy users in 2.5.33..

On Thursday 05 September 2002 20:42, Andrew Morton wrote:
> The `nr_of_sectors_completed' would be tricky to handle - we don't know how
> to bring the pagecache pages uptodate in response to a 512-byte completion.
> We'd have to teach the pagecache read functions to permit userspace to read
> from non-uptodate pages by looking at the buffer_head state. And then
> look at buffer_req to distinguish "this part of the page hasn't been read yet"
> from "this part of the page had an IO error". Ick.

It's yet another reason to have subpages, and see, keeping state
at the right granularity really does matter.

I'm just adding items to the list of reasons why we need it, so
when the time comes to do it, I won't have to waste a lot of energy
explaining why.

--
Daniel

2002-09-07 14:39:14

by George Spelvin

[permalink] [raw]
Subject: Re: One more bio for for floppy users in 2.5.33..

On Fri, 6 Sep 2002, Linus Torvalds wrote:
> Note that the delay for motor on/off is _much_ larger than the actual
> delay for seeking.
>
> The seek itself is on the order of a few ms, with the head settle time
> being in the tens (possibly even a few hundred) ms per track. So assuming
> you end up reading 4 tracks or so due to readahead, that's still in the
> range of about one second.
>
> In contrast, the motor on/off time is something like 5 seconds if I
> remember correctly. Of course, you can certainly eject the floppy while
> the motor is still running, but I'd suggest against it.

You're forgetting the transfer rate. A 1440K floppy has 160
tracks (80 cylinders * 2 heads), or 9K per track.

It spins at 300 RPM, so it takes at least 200 ms to read that track.
45K/sec.

A 64K read spans 7.11 tracks, which will take 1422 ms to read.
Add 100 ms for initial rotational latency, and assume that subsequent
tracks are optimally arranged for continuous reads.

That's 1.5 secods just to transfer the data.

*Then* you can add all of the seek and motor spin-up/down times
mentioned above.

(Of course, if the floppy *isn't* formatted optimally, add an
extra 100 ms per seek, or 700 ms total, of rotatinal latency.)

2002-09-09 14:04:15

by Bob Tracy

[permalink] [raw]
Subject: Re: One more bio for for floppy users in 2.5.33..

Linus Torvalds wrote:
> Oh, well. I don't seem to be the only one who doesn't use the dang things
> any more. The floppy driver has been broken in 2.5.x for half a year or
> whatever, and there weren't _that_ many people who ever even mentioned it.

Oh, we're out here alright... I didn't speak up (until now) for several
reasons:

(a) development kernel == things break.
(b) when things break, people squawk -- I hardly expected that I'd need
to add my voice to the "me too" chorus (bad form, don't you know?).
(c) broken things tend to get fixed quickly, which is another way of
saying I have faith in the developers' abilities :-).

Anyway, floppy support is still reasonably important to me. I've got a
few legacy systems that cannot boot from from CD-ROM, and a 3.5" floppy
is still handy for small sneaker-net transfers between non-networked
machines.

Thanks for reading...

--
-----------------------------------------------------------------------
Bob Tracy WTO + WIPO = DMCA? http://www.anti-dmca.org
[email protected]
-----------------------------------------------------------------------

2002-09-10 07:21:49

by Rogier Wolff

[permalink] [raw]
Subject: Re: One more bio for for floppy users in 2.5.33..

On Thu, Sep 05, 2002 at 12:03:30PM -0700, Linus Torvalds wrote:
>
> On Thu, 5 Sep 2002, Andrew Morton wrote:
> >
> > It would be simpler if it was nr_of_pages_completed.
>
> Well.. Maybe.

Ehmm. I'm in the data-recovery business, and we seem to have lost
the ability to recover the other 3k of a 4k page if one of the blocks
is bad.

And we're annoyed about the read-ahead trying to read blocks past
a bad block without returning to the application.

Roger.

--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2137555 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
* The Worlds Ecosystem is a stable system. Stable systems may experience *
* excursions from the stable situation. We are currenly in such an *
* excursion: The stable situation does not include humans. ***************

2002-09-10 07:41:51

by Andrew Morton

[permalink] [raw]
Subject: Re: One more bio for for floppy users in 2.5.33..

Rogier Wolff wrote:
>
> ...
>
> Ehmm. I'm in the data-recovery business, and we seem to have lost
> the ability to recover the other 3k of a 4k page if one of the blocks
> is bad.
>
> And we're annoyed about the read-ahead trying to read blocks past
> a bad block without returning to the application.
>

You can use the raw driver, or O_DIRECT against /dev/hdXX. That
will give 512-byte granularity and no readahead.