Hi,
Since kernel 2.4.10, I observe a rather poor performance of the good old
floppy drive. It used to work find in kernel 2.4.9. Kernel 2.4.10 broke
it, and 2.4.12 did not fix it.? I access my floppy using mtools-3.9.7-4 as
found in RH 7.1.
Not knowing anything about the kernel internals, I would say that the
floppy driver does not do any caching anymore. With kernel 2.4.9, "mdir"
of a standard 1.44 MB, completely empty floppy disk takes about 1 second
for the first invocation, and next to nothing for the subsequent ones.
>From 2.4.10 on, it takes over 2 seconds every time. Writing a 30 KB file
with 2.4.9 takes below 3 seconds, with >=2.4.10 it's over 6 seconds.
Maybe I should just keep quiet and be happy that it works at all, after
all floppies are rarely used nowadays, but letting such a performance
degradation go unreported is just beyond me :-).
Regards,
--
Kamil Iskra http://www.science.uva.nl/~kamil/
Section Computational Science, Faculty of Science, Universiteit van Amsterdam
[email protected] tel. +31 20 525 75 35 fax. +31 20 525 74 90
Kruislaan 403 room F.202 1098 SJ Amsterdam The Netherlands
Negative here, I am using 2.4.12-ac2 here and did not
notice what you said, the speed of transfer is still
about 23KB/sec when copying a 1.1Mb file from floppy ;
it is the same as before, even better :-)
> Please read the FAQ at http://www.tux.org/lkml/
=====
S.KIEU
http://briefcase.yahoo.com.au - Yahoo! Briefcase
- Manage your files online.
On Thu, 18 Oct 2001, [iso-8859-1] Steve Kieu wrote:
> Negative here, I am using 2.4.12-ac2 here and did not
> notice what you said,
Well, perhaps the ac series does not suffer from this problem? As I
stated in the original mail, I'm using 2.4.12 (from Linus).
> the speed of transfer is still
> about 23KB/sec when copying a 1.1Mb file from floppy ;
> it is the same as before, even better :-)
I just measured it for a file of that size, and I even got 29KB/sec.
However, it's performance for small files, directory listing operations
etc. that I'm complaining about. And not for a mounted floppy (which
seems to be fine), but when using mtools.
So, to reiterate, the conditions known to be necessary to reproduce it
are: kernel >=2.4.10 (perhaps only the Linus series), small files or
directory operations, mtools. The behaviour is as if no caching was done,
there is a slowdown by a factor of two. I have this problem both on my
laptop and on the desktop machine at work. They are running different
kernel versions (2.4.12 and 2.4.10), differently configured and compiled
by two different people. Kernel 2.4.9 and earlier worked fine.
Regards,
--
Kamil Iskra http://www.science.uva.nl/~kamil/
Section Computational Science, Faculty of Science, Universiteit van Amsterdam
[email protected] tel. +31 20 525 75 35 fax. +31 20 525 74 90
Kruislaan 403 room F.202 1098 SJ Amsterdam The Netherlands
On Oct 18, 2001 12:11 +0200, Kamil Iskra wrote:
> So, to reiterate, the conditions known to be necessary to reproduce it
> are: kernel >=2.4.10 (perhaps only the Linus series), small files or
> directory operations, mtools. The behaviour is as if no caching was done,
> there is a slowdown by a factor of two. I have this problem both on my
> laptop and on the desktop machine at work. They are running different
> kernel versions (2.4.12 and 2.4.10), differently configured and compiled
> by two different people. Kernel 2.4.9 and earlier worked fine.
I think this is a result of the "blockdev in pagecache" change added in
2.4.10. One of the byproducts of this change is that if a block device
is closed (no other openers) then all of the pages from this device are
dropped from the cache. In the case of a floppy drive, this is very
important, as you don't want to be cacheing data from one floppy after
you have inserted a new floppy.
In contrast, if you mounted the floppy instead of using mtools, it would
probably have good performance for small files as well.
Cheers, Andreas
--
Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto,
\ would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert
On Thu, 18 Oct 2001, Andreas Dilger wrote:
> > The behaviour is as if no caching was done,
> > there is a slowdown by a factor of two.
> I think this is a result of the "blockdev in pagecache" change added in
> 2.4.10. One of the byproducts of this change is that if a block device
> is closed (no other openers) then all of the pages from this device are
> dropped from the cache. In the case of a floppy drive, this is very
> important, as you don't want to be cacheing data from one floppy after
> you have inserted a new floppy.
>
> In contrast, if you mounted the floppy instead of using mtools, it would
> probably have good performance for small files as well.
That's very interesting. It would explain why it takes 2 seconds _every_
time you invoke "mdir", whereas before the invocations after the first one
were more or less instantenous. And indeed, as you say, mounting a floppy
does result in a good performance.
However, it does not explain why the first invocation is two times slower
(it's 1 sec with kernel 2.4.9 and 2 secs with 2.4.10, the effect is even
more visible for mcopy of a small file, like 30KB). I strace'd mdir and
it's opening /dev/fd0 just once, at the beginning, and closing it at the
end.
Regards,
--
Kamil Iskra http://www.science.uva.nl/~kamil/
Section Computational Science, Faculty of Science, Universiteit van Amsterdam
[email protected] tel. +31 20 525 75 35 fax. +31 20 525 74 90
Kruislaan 403 room F.202 1098 SJ Amsterdam The Netherlands
On Thu, Oct 18, 2001 at 05:42:44PM +0200, you [Kamil Iskra] claimed:
> On Thu, 18 Oct 2001, Andreas Dilger wrote:
>
> > > The behaviour is as if no caching was done,
> > > there is a slowdown by a factor of two.
> > I think this is a result of the "blockdev in pagecache" change added in
> > 2.4.10. One of the byproducts of this change is that if a block device
> > is closed (no other openers) then all of the pages from this device are
> > dropped from the cache. In the case of a floppy drive, this is very
> > important, as you don't want to be cacheing data from one floppy after
> > you have inserted a new floppy.
> >
> > In contrast, if you mounted the floppy instead of using mtools, it would
> > probably have good performance for small files as well.
>
> That's very interesting. It would explain why it takes 2 seconds _every_
> time you invoke "mdir", whereas before the invocations after the first one
> were more or less instantenous. And indeed, as you say, mounting a floppy
> does result in a good performance.
>
> However, it does not explain why the first invocation is two times slower
> (it's 1 sec with kernel 2.4.9 and 2 secs with 2.4.10, the effect is even
> more visible for mcopy of a small file, like 30KB). I strace'd mdir and
> it's opening /dev/fd0 just once, at the beginning, and closing it at the
> end.
That's propably beacause it syncs the writes on close().
Perhaps you could try the trick Linus suggested in another thread, namely:
sleep 1000 < /dev/fd0 &
mdir
mcopy
mdir
mcopy
<do whatever>
kill %1
That keeps one (dummy) reference to the floppy device open until you're done
using it.
-- v --
[email protected]
On Thu, 18 Oct 2001, Andreas Dilger wrote:
> I think this is a result of the "blockdev in pagecache" change added in
> 2.4.10. One of the byproducts of this change is that if a block device
> is closed (no other openers) then all of the pages from this device are
> dropped from the cache. In the case of a floppy drive, this is very
> important, as you don't want to be cacheing data from one floppy after
> you have inserted a new floppy.
RTFS. That _exactly_ matches the behaviour of buffer-cache variant.
On Thursday 18 October 2001 11:17, Ville Herva wrote:
(snip)
> That's propably beacause it syncs the writes on close().
>
> Perhaps you could try the trick Linus suggested in another thread, namely:
>
> sleep 1000 < /dev/fd0 &
>
> mdir
> mcopy
> mdir
> mcopy
> <do whatever>
>
> kill %1
>
> That keeps one (dummy) reference to the floppy device open until you're
> done using it.
Perhaps there should be a pair of "mtools" added: mopen and mclose, that do
basically this. That way it could be a "standard" item, documented in man
pages, etc., not some secret that only the l-k users know. Thoughts?
-Nick
On Thu, Oct 18, 2001 at 12:18:12PM -0400, Alexander Viro wrote:
>
>
> On Thu, 18 Oct 2001, Andreas Dilger wrote:
>
> > I think this is a result of the "blockdev in pagecache" change added in
> > 2.4.10. One of the byproducts of this change is that if a block device
> > is closed (no other openers) then all of the pages from this device are
> > dropped from the cache. In the case of a floppy drive, this is very
> > important, as you don't want to be cacheing data from one floppy after
> > you have inserted a new floppy.
>
>
> RTFS. That _exactly_ matches the behaviour of buffer-cache variant.
Indeed, only 2.2 trusted the check media change information and left the
cache valid on top of the floppy across close/open of the blkdev.
Maybe it's because a read now fills in the whole page rather than 1k.
So if you write 1k aligned, and you read it back, the read will it the
disk and it will have to read another 3k before it can return such 1k
info that was just in memory. It's exactly the same on the files on a 1k
ext2 filesystem.
Alexander just proposed some month ago to support the partial reads in
the pagecache like we do for partial writes: by not marking the page
uptodate for partial reads and by going down to the buffer layer every
time to check if the partial memory we're interested about is just
uptodate or not (and marking the page uptodate eventually if we end
reading the whole page). So it's nothing unfixable, we just need to
understand if that is the problem and/or if mtools is doing something
stupid that sorted out to work ok when partial reads were supported.
Andrea
In article <[email protected]> [email protected] wrote:
| Perhaps there should be a pair of "mtools" added: mopen and mclose, that do
| basically this. That way it could be a "standard" item, documented in man
| pages, etc., not some secret that only the l-k users know. Thoughts?
At the risk of seeming ungenerous, mtools was a hack, to compensate
for the inability of some operating systems to handle the DOS
filesystem. While it's useful, I don't thing blindingly fast performance
is a requirement.
That said, I have a few other thoughts. First, can't the kernel
detect when a new floppy is inserted? I can't remember if there is an
interupt generated when the floppy seats or not.
And second, can't you just avoid the whole issue by keeping the floppy
accessed at all time while you use it? Something like:
sleep 3600 </dev/fd0 &
or some such to lock the pages after they are read?
The change is a good one, it avoids having stale data in memory which
might cause worse things than slow performance. I think I have been
bitten by that in the past, but I don't remember any details, so it may
have been another problem.
--
bill davidsen <[email protected]>
His first management concern is not solving the problem, but covering
his ass. If he lived in the middle ages he'd wear his codpiece backward.
In article <[email protected]> [email protected] wrote:
| On Thursday 18 October 2001 11:17, Ville Herva wrote:
|
| (snip)
| > That's propably beacause it syncs the writes on close().
| >
| > Perhaps you could try the trick Linus suggested in another thread, namely:
| >
| > sleep 1000 < /dev/fd0 &
| >
| > mdir
| > mcopy
| > mdir
| > mcopy
| > <do whatever>
| >
| > kill %1
| >
| > That keeps one (dummy) reference to the floppy device open until you're
| > done using it.
|
| Perhaps there should be a pair of "mtools" added: mopen and mclose, that do
| basically this. That way it could be a "standard" item, documented in man
| pages, etc., not some secret that only the l-k users know. Thoughts?
The change prevents use of stale data, and is a good one. mtools was a
hack from the days when some operating systems didn't speak DOS file
format, and reliability is more important than performance.
That said, can't you just keep the drive accessed and avoid the
problem?
sleep 3600 </dev/fd0 &
should do it. Just kill the process when you eject the floppy, or you
will get back the stale data problem.
I don't believe the detection of a new floppy in the drive is reliable
on all systems or they would have flushed on loading a new diskette. I
seem to remember that not all drives provide the signal, at least back
when I wrote my last floppy driver (DEC Rainbow, about 20 years ago).
--
bill davidsen <[email protected]>
His first management concern is not solving the problem, but covering
his ass. If he lived in the middle ages he'd wear his codpiece backward.
On Thu, 18 Oct 2001, bill davidsen wrote:
> The change prevents use of stale data, and is a good one. mtools was a
Folks, could you please read the fucking source before discussing the
change that was not?
We had been flushing the cache upon final close() for quite a while; recent
changes come from something else and figuring out WTF had happened in 2.4.12
would be a Good Thing(tm).
On Thursday 18 October 2001 14:57, bill davidsen wrote:
> In article <[email protected]>
[email protected] wrote:
> | Perhaps there should be a pair of "mtools" added: mopen and mclose, that
> | do basically this. That way it could be a "standard" item, documented in
> | man pages, etc., not some secret that only the l-k users know. Thoughts?
>
> At the risk of seeming ungenerous, mtools was a hack, to compensate
> for the inability of some operating systems to handle the DOS
> filesystem. While it's useful, I don't thing blindingly fast performance
> is a requirement.
Yeah, but they're still handy from time to time. It's certainly quicker to
pop a disk in and type "mdir", than to put it in, type "mount /floppy; ls
/floppy; umount /floppy" or similar. You get the picture. Not that you
don't have a valid point, too.
> That said, I have a few other thoughts. First, can't the kernel
> detect when a new floppy is inserted? I can't remember if there is an
> interupt generated when the floppy seats or not.
If I'm not mistaken, and I think has already been mentioned somewhere before
in this thread, these interrupts where determined to be unreliable under some
circumstances. Anybody remember the details?
> And second, can't you just avoid the whole issue by keeping the floppy
> accessed at all time while you use it? Something like:
> sleep 3600 </dev/fd0 &
> or some such to lock the pages after they are read?
That's basically what I'd propose "mopen" do... Hold it open 'til the user
does a "mclose". Of course, if you yank the disk out in the middle, all bets
are off.
-Nick
> Indeed, only 2.2 trusted the check media change information and left the
> cache valid on top of the floppy across close/open of the blkdev.
Which is not a bad thing IMHO, but it can cause problems with
some broken SCSI implementation where the drive doesn't send
UNIT_ATTENTION after a media change (like my MO drive when I
misconfigured the jumpers, damn :-((( ).
Bye.
On Fri, Oct 19, 2001 at 09:50:06AM +0200, Giuliano Pochini wrote:
>
> > Indeed, only 2.2 trusted the check media change information and left the
> > cache valid on top of the floppy across close/open of the blkdev.
>
> Which is not a bad thing IMHO, but it can cause problems with
> some broken SCSI implementation where the drive doesn't send
> UNIT_ATTENTION after a media change (like my MO drive when I
> misconfigured the jumpers, damn :-((( ).
Yes, I was aware of that. We'd need a kind of "media change enabler"
bitflag in each lowlevel driver, to implement a blacklist (or whitelist
if you feel safer) that will tell us if to trust the media change info
or not.
Ciao,
Andrea
In article <[email protected]>,
Giuliano Pochini <[email protected]> wrote:
>
>> Indeed, only 2.2 trusted the check media change information and left the
>> cache valid on top of the floppy across close/open of the blkdev.
>
>Which is not a bad thing IMHO, but it can cause problems with
>some broken SCSI implementation where the drive doesn't send
>UNIT_ATTENTION after a media change (like my MO drive when I
>misconfigured the jumpers, damn :-((( ).
Well, the original reason to not trust the media-change signal is that
some floppy drives simply do not implement the signal at all. Don't ask
me why. So a loong time ago Linux had the problem that when you changed
floppies you wouldn't see the new information - or you'd see _partially_
new and old information depending on what your access patterns were and
what the caches contained.
So it's pretty much across the board - broken SCSI, broken floppies,
just about any changeable media tends to have _some_ bad cases. And with
the floppy case, there was no way to notice at run-time whether the unit
was broken or not - the floppy drives have no ID's to blacklist etc. So
either you tell people to flush their caches by hand (which we did), or
you just always flush it between separate opens (which we later did).
Linus
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
> And with
> the floppy case, there was no way to notice at run-time whether the
> unit was broken or not - the floppy drives have no ID's to blacklist
> etc.
The standard trick is to start with media-change not supported, and
enable it if you get the first change signal.
There are really old floppies that don't support media-change signals,
and they _never_ send it. If you see a media-change signal, then you
know that the floppy is not broken.
Probably a timer (2 seconds) and a delayed cache flush should fix the
problem.
If the device supports media-change and media-lock, then it could
increase the timeout value.
--
Manfred
On Friday 19 October 2001 11:57, Linus Torvalds wrote:
> Well, the original reason to not trust the media-change signal is that
> some floppy drives simply do not implement the signal at all. Don't ask
> me why. So a loong time ago Linux had the problem that when you changed
> floppies you wouldn't see the new information - or you'd see _partially_
> new and old information depending on what your access patterns were and
> what the caches contained.
>
> So it's pretty much across the board - broken SCSI, broken floppies,
> just about any changeable media tends to have _some_ bad cases. And with
> the floppy case, there was no way to notice at run-time whether the unit
> was broken or not - the floppy drives have no ID's to blacklist etc. So
> either you tell people to flush their caches by hand (which we did), or
> you just always flush it between separate opens (which we later did).
>
> Linus
The original dos case was timeout based. They sat down and changed the disk
as fast as they could, and worked out it took something like two and a half
seconds to swap disks. So if subsequent accesses were within two and a half
seconds and got valid data on the first attempt, they decided it had to be
the same disk...
These days with the button it's probably more like a second and a half, but
the principle's the same.
Also, enough drives do it right (the vast majority), that a
"broken_disk_change" module/boot option seems more sensible as a non-default
thing for those that really are hosed...
Rob
Sorry for joining this discussion so late, but I only check
linux-kernel only about once or twice a week.
Rather than responding to each message individually, here's a small
summary about my take on the issues:
Andreas Dilger writes:
>In the case of a floppy drive, this is very important, as you don't
>want to be cacheing data from one floppy after you have inserted a new
>floppy.
Actually, the floppy drive is able to detect disk changes just fine,
and since 1.0's the disk change signal (DCL) has been used to trigger
cache flushing.
Ville Herva wrote:
>That's propably beacause it syncs the writes on close().
This is done for the benefit of those OS'es that don't automatically
sync() removeable media on last close. On Linux this is not necessary
(or at least, it wasn't when I last looked at the code, but that's a
long while ago...), but it is not harmful either: even though the sync
would hold up mtools for a while, the close would go through much
faster, as the work already will have done. And besides, as far as I
understood, Kamil was speaking about read performance, where the sync
would be a non-issue.
Nick LeRoy wrote:
>Perhaps there should be a pair of "mtools" added: mopen and mclose, that do
>basically this. That way it could be a "standard" item, documented in man
>pages, etc., not some secret that only the l-k users know. Thoughts?
A nice workaround. However, personnally I'd rather have this fixed in
the kernel, rather than having to resort to those kinds of
workarounds. Moreover, this is potentially dangerous: it could
actually mask a real disk change, if the user forgets the mclose. The
same holds for other techniques, such as sleep 3600 </dev/fd0 & Btw,
it works now (actual disk change detected, even while the sleep runs),
but who will guarantee it still will work in 2.6 ?
Nevertheless, I'm thinking about introducing this to mtools as a
stop-gap measure, but with ample warnings in the manpage...
Bill Davidsen wrote:
>That said, I have a few other thoughts. First, can't the kernel
>detect when a new floppy is inserted?
Yes, the kernel (floppy driver) can do this, and indeed it does.
>I can't remember if there is an
>interupt generated when the floppy seats or not.
Actually, it is not really an interrupt, but a bit that is set in the
FD_DIR register. It stays set until the floppy disk driver
acknowledges it by seeking the drive, or by selecting/unselecting it.
Before reading from the disk (or whenever it needs to know whether a
disk has been changed or not), the floppy driver reads this bit, and
if set notifies the VFS of the disk change. It then proceeds to seek,
in order to clear this flag (needed in order to detect further
changes)
>I
>seem to remember that not all drives provide the signal, at least back
>when I wrote my last floppy driver (DEC Rainbow, about 20 years
ago).
Yes, very old drives have problems supplying the signal. However, most
_recent_ cases of broken disk change line are due to ... badly
inserted cabled (d'oh). The problem is that the DCL is near the edge
(pin 34, according to http://www.moria.de/~michael/floppy/floppy.ps),
and if the cable is not inserted straight, the other wires may still
make contact, but not the DCL. Hence the drive "appears" to work at
first glance, except for the problem with the cached stale
data. Actually, in the last couple of years, I got less than a dozen
bug reports about a broken disk change line, and those could all be
traced to a badly inserted cable...
A compounding problem is that there are different ways to clear the
DCL signal. Many drives need a seek, but for some drives you can clear
the signal just by selecting/unselecting the drive, which makes it far
too easy to accidentally clear it. This was causing many problems in
the 1.0 days, but has been solved since long ago (by being extra
careful to check the DCL before unselecting a drive. This is also the
major reason for the convolutedness of that part of the code).
Alexander Viro wrote:
>We had been flushing the cache upon final close() for quite a while; recent
>changes come from something else and figuring out WTF had happened in 2.4.12
>would be a Good Thing(tm).
Andrea Arcangeli wrote:
>Indeed, only 2.2 trusted the check media change information and left the
>cache valid on top of the floppy across close/open of the blkdev.
You are right: indeed, 2.4.4 seems to have the same problem. If you do
mcopy a:somefile . in a row, the second one is as slow as the first
one!
Only mdir a: seems to behave differently between 2.4.4 (on my laptop)
and 2.4.12 (on my desktop): in 2.4.4 the second one is instantaneous,
whereas in 2.4.12 it is slow. Possible explanation: mdir reads so few
data that it fits in one track. And the floppy driver has an own
internal "track cache", which caches one track worth of data (needed
for some low-level optimizations, just like the on-board cache of hard
disks). Obviously, the floppy driver itself does not ignore the fact
that the disk has not been changed when deciding whether to flush its
own cache or not. Apparently, it's this particular feature which has
been broken in 2.4.10. I'll have to investigate further to find out
why exactly there is this difference. (I'll probably get around it
next weekend unfortunately, I'm somewhat busy right now).
>if mtools is doing something
>stupid that sorted out to work ok when partial reads were supported.
The problem can be easily demonstrated with doing 'cat /dev/fd0 >x'
twice in a row to simulate an mcopy. On both kernels that I tried, the
second one is as slow the first. So I don't think it's related to
mtools per se.
In order to simulate an mdir:
dd if=/dev/fd0 bs=1k count=18 of=/dev/null
Here, the second one is instantaneous on 2.4.4, but slow on
2.4.12. Changing the 18 into 19 makes it slow on both kernels (because
that's now two tracks, rather than one).
However, for the following one, it's fast the second time on both
kernels:
dd if=/dev/fd0 bs=1k count=2 of=/dev/null
Changing 2 into 3 makes it slow on 2.4.12
Maybe, it's some kind of read-ahead issue, that the kernel decides to
internally read much more sectors than fit in one track, which would
"push" the sectors of the first cylinder out of the floppy driver's
internal cache?
Andrea Arcangeli wrote:
>Yes, I was aware of that. We'd need a kind of "media change enabler"
>bitflag in each lowlevel driver, to implement a blacklist (or whitelist
>if you feel safer) that will tell us if to trust the media change info
>or not.
Exactly!
Linus Torvalds wrote:
>And with
>the floppy case, there was no way to notice at run-time whether the unit
>was broken or not - the floppy drives have no ID's to blacklist etc.
True. However, we could use an ioctl to instruct the floppy driver,
(which would then instruct the VFS layer) whether it can trust the
disk change line or not.
That way, users with known broken drives can pluck that into their
/etc/rc.d/boot.local
Actually, we do have such a setting right now, it's called BROKEN_DCL
Just to floppycontrol --broken_dcl /dev/fd0 to set it.
Its effect can be easily verified: after doing this, mdir a: is slow
on both kernels...
You can also set it on the boot (LILO) command line: append="floppy=broken_dcl"
The only thing is, right now this is handled locally in the floppy
driver, and not communicated to the VFS. What we would need is a way
to tell the VFS "yes, you _can_ trust check_media_change". Or is there
already such a way?
Rob Landley wrote:
>Also, enough drives do it right (the vast majority), that a
>"broken_disk_change" module/boot option seems more sensible as a non-default
>thing for those that really are hosed...
Surprise! There is such a setting (see above): broken_dcl and it works
as well for module (insmod floppy.o floppy=broken_dcl), as for boot
(see above), as for iotctl.
Only problem: although the floppy driver takes it into account, the
VFS layer seems to happily ignore it...
This happens in the following function, which is called
unconditionnally, whenever the numbers of "openers" (processes having
an open filehandle to the device) drops to zero:
(blkdev_put)
if (kind == BDEV_FILE)
__block_fsync(bd_inode);
else if (kind == BDEV_FS)
fsync_no_super(rdev);
if (!--bdev->bd_openers)
kill_bdev(bdev);
/* Kill _all_ buffers, dirty or not.. */
static void kill_bdev(struct block_device *bdev)
{
invalidate_bdev(bdev, 1);
truncate_inode_pages(bdev->bd_inode->i_mapping, 0);
}
And in the comment for invalidate_bdev, we find the following comment:
NOTE: In the case where the user removed a removable-media-disk even if
there's still dirty data not synced on disk (due a bug in the device driver
or due an error of the user), by not destroying the dirty buffers we could
generate corruption also on the next media inserted, thus a parameter is
necessary to handle this case in the most safe way possible (trying
to not corrupt also the new disk inserted with the data belonging to
the old now corrupted disk). Also for the ramdisk the natural thing
to do in order to release the ramdisk memory is to destroy dirty buffers.
So, if the goal is only to catch user and/or driver error, maybe the
same thing could be achieved by _only_ flushing the dirty buffers (and
loudly complaining if it finds any: indeed, because of the call to
__block_fsync(bd_inode); there really shouldn't be any dirty buffers
at this point of time...
Manfred Spraul wrote:
>The standard trick is to start with media-change not supported, and
>enable it if you get the first change signal.
>There are really old floppies that don't support media-change signals,
>and they _never_ send it. If you see a media-change signal, then you
>know that the floppy is not broken.
Good idea! I'll think about implementing something like that.
Regards,
Alain
On Sun, Oct 21, 2001 at 01:36:09PM +0200, Alain Knaff wrote:
> Actually, the floppy drive is able to detect disk changes just fine,
> and since 1.0's the disk change signal (DCL) has been used to trigger
> cache flushing.
I also experienced this correct media change behaviour (even if it's a
long time that I don't use a floppy).
Anyways the check media change is definitely the _right_ fix for the
dvd player stalling because of a backwards seek generated by the fact we
discarded readahead, after waiting synchronoulsy for I/O completion in
truncate_inode_pages in blkdev_close. Turning down the readahead size
can only hide the problem and hurt all the other users (and xine itself
since doing less and large DMA transactions is beneficial for the dvd
too). So I believe in the long run we must implement a whitelist that
tells us when to trust the media change detection, and always trim the
cache during blkdev_open or during rmmod as I also suggested during
2.3.x when blkdev_close was changed to do the unconditional
invalidate_buffers. Current design is a bit simpler but also simply not
smart enough to do the right thing. In the meantime dvd players would
better keep a dummy reference open, and yes, I totally agree it should
be the kernel that has to be fixed and that the userspace workaround can
go away eventually.
BTW, in the xine player stalls thread, Linus said raw (not rawio)
device, but somebody understood rawio, he meant /dev/hdc of course, not
/dev/raw1 with rawio, rawio cannot suffer from the stall generated by
backwards seek because we dropped the readahead. rawio doesn't use cache
or readahead at all. Infact it's interesting to know if rawio suffers
from such stall problem too, if this is the case than the stall have
nothing to do with readadhead or truncate_inode_pages, and it has to be
related to something else (I'd point my finger to the vfs in such a case
because there's not much else involved in the close/open cycle if only
using rawio).
Andrea
On Mon, 22 Oct 2001, Andrea Arcangeli wrote:
> too). So I believe in the long run we must implement a whitelist that
> tells us when to trust the media change detection, and always trim the
> cache during blkdev_open or during rmmod as I also suggested during
> 2.3.x when blkdev_close was changed to do the unconditional
I think that I have a correct fix for that, but I'll need to sort
some devfs-related unpleasantness first...
On Sunday 21 October 2001 06:36, Alain Knaff wrote:
(snip)
> Nick LeRoy wrote:
> >Perhaps there should be a pair of "mtools" added: mopen and mclose, that
> > do basically this. That way it could be a "standard" item, documented in
> > man pages, etc., not some secret that only the l-k users know. Thoughts?
>
> A nice workaround. However, personnally I'd rather have this fixed in
> the kernel, rather than having to resort to those kinds of
> workarounds. Moreover, this is potentially dangerous: it could
> actually mask a real disk change, if the user forgets the mclose. The
> same holds for other techniques, such as sleep 3600 </dev/fd0 & Btw,
> it works now (actual disk change detected, even while the sleep runs),
> but who will guarantee it still will work in 2.6 ?
>
> Nevertheless, I'm thinking about introducing this to mtools as a
> stop-gap measure, but with ample warnings in the manpage...
Absolutely... I *would not* advocate this as a real fix if we can, in
general, get the interrupt to work... I would advocate a *real* fix in the
kernel, and this using mopen/mclose for broken hardware & kernels that don't
handle it as gracefully.
-Nick
In article <[email protected]> [email protected] wrote:
| Sorry for joining this discussion so late, but I only check
| linux-kernel only about once or twice a week.
|
| Rather than responding to each message individually, here's a small
| summary about my take on the issues:
| Bill Davidsen wrote:
| >That said, I have a few other thoughts. First, can't the kernel
| >detect when a new floppy is inserted?
|
| Yes, the kernel (floppy driver) can do this, and indeed it does.
|
| >I can't remember if there is an
| >interupt generated when the floppy seats or not.
|
| Actually, it is not really an interrupt, but a bit that is set in the
| FD_DIR register. It stays set until the floppy disk driver
| acknowledges it by seeking the drive, or by selecting/unselecting it.
|
| Before reading from the disk (or whenever it needs to know whether a
| disk has been changed or not), the floppy driver reads this bit, and
| if set notifies the VFS of the disk change. It then proceeds to seek,
| in order to clear this flag (needed in order to detect further
| changes)
|
|
| >I
| >seem to remember that not all drives provide the signal, at least back
| >when I wrote my last floppy driver (DEC Rainbow, about 20 years
| ago).
|
| Yes, very old drives have problems supplying the signal. However, most
| _recent_ cases of broken disk change line are due to ... badly
| inserted cabled (d'oh). [ tech details ]
As was pointed out to me, old floppies don't generate the information
in all cases, nor do some drives for laptop use. But the suggestion was
made that once we get the status for media changed that we can (only)
then flag the status as trusted and not flush the cache on final close.
Unless you boot the system with the floppy inserted I would expect
inserting the first media to set this if the system is trustworthy.
Drives which don't always work, and cables which are so marginal that
they work or not depending on temperature or vibration could happily be
ignored IMHO, flakey hardware is not deserving of such care, as opposed
to hardware which misbehaves in a predictable way.
--
bill davidsen <[email protected]>
His first management concern is not solving the problem, but covering
his ass. If he lived in the middle ages he'd wear his codpiece backward.
Appended to this mail is a patch meant to fix the "non-cached floppy"
problem.
The block_device_operations structure now has a new member:
can_trust_media_change
If this returns 1, the VFS trusts the driver's check_media_change
function, and skips the call to kill_bdev in blkdev_put. If the driver
provides no such function, it defaults to "don't trust" (i.e. call
kill_bdev)
The floppy driver implements this new function in the following way:
1. if broken_dcl on a drive is set, can_trust_media_change always
returns false
2. if broken_dcl is not set, the function initially returns false,
until a disk change signal is seen. From then on, it always returns
true.
Other block devices could implement this by using black or whitelists,
or hardwiring a value.
HOWEVER, even after these changes, the caching problem still wasn't
solved (I confirmed with additional printk's that invalidate_bdev was
indeed no longer called).
Further investigation showed that the real culprit was actually
bdput. This function seems to de-allocate all structures attached to
the block device, including the cache... Ifdeffing it out did indeed
restore cache functionality. However, I understand that this would not
be a proper way to address the problem, as it would certainly cause a
memory leak.
Appended to this mail is the patch (without the debugging printk's,
and with bdput still intact); however in order to make it really work
bdput would need to be changed to not free up the cache, if that is
possible... Any ideas?
Regards,
Alain
diff -ur 2.4.13/linux/drivers/block/floppy.c linux/drivers/block/floppy.c
--- 2.4.13/linux/drivers/block/floppy.c Sat Oct 27 09:42:34 2001
+++ linux/drivers/block/floppy.c Sat Oct 27 16:17:33 2001
@@ -741,6 +741,7 @@
return UTESTF(FD_DISK_CHANGED);
if ((fd_inb(FD_DIR) ^ UDP->flags) & 0x80){
USETF(FD_VERIFY); /* verify write protection */
+ USETF(FD_DCL_SEEN); /* DCL for this drive has been seen */
if (UDRS->maxblock){
/* mark it changed */
USETF(FD_DISK_CHANGED);
@@ -3901,12 +3902,23 @@
return 0;
}
+/* determines whether we can trust media change
+ * answers true if we have seen disk change at least once, and not marked as
+ * broken */
+static int floppy_can_trust_media_change(kdev_t dev)
+{
+ int drive=DRIVE(dev);
+ return (UTESTF(FD_DCL_SEEN) &&
+ ((UDP->flags & FD_BROKEN_DCL) == 0));
+}
+
static struct block_device_operations floppy_fops = {
open: floppy_open,
release: floppy_release,
ioctl: fd_ioctl,
check_media_change: check_floppy_change,
revalidate: floppy_revalidate,
+ can_trust_media_change: floppy_can_trust_media_change
};
static void __init register_devfs_entries (int drive)
diff -ur 2.4.13/linux/fs/block_dev.c linux/fs/block_dev.c
--- 2.4.13/linux/fs/block_dev.c Sat Oct 27 09:43:27 2001
+++ linux/fs/block_dev.c Sat Oct 27 16:18:19 2001
@@ -601,8 +601,12 @@
__block_fsync(bd_inode);
else if (kind == BDEV_FS)
fsync_no_super(rdev);
- if (!--bdev->bd_openers)
- kill_bdev(bdev);
+ if (!--bdev->bd_openers) {
+ if(!bdev->bd_op->can_trust_media_change ||
+ !bdev->bd_op->can_trust_media_change(rdev)) {
+ kill_bdev(bdev);
+ }
+ }
if (bdev->bd_op->release)
ret = bdev->bd_op->release(bd_inode, NULL);
if (!bdev->bd_openers)
diff -ur 2.4.13/linux/include/linux/fd.h linux/include/linux/fd.h
--- 2.4.13/linux/include/linux/fd.h Fri Aug 13 21:16:16 1999
+++ linux/include/linux/fd.h Sat Oct 27 13:37:08 2001
@@ -177,7 +177,8 @@
* to clear media change status */
FD_UNUSED_BIT,
FD_DISK_CHANGED_BIT, /* disk has been changed since last i/o */
- FD_DISK_WRITABLE_BIT /* disk is writable */
+ FD_DISK_WRITABLE_BIT, /* disk is writable */
+ FD_DCL_SEEN_BIT /* have we seen DCL? */
};
#define FDSETDRVPRM _IOW(2, 0x90, struct floppy_drive_params)
@@ -196,6 +197,7 @@
#define FD_DISK_NEWCHANGE (1 << FD_DISK_NEWCHANGE_BIT)
#define FD_DISK_CHANGED (1 << FD_DISK_CHANGED_BIT)
#define FD_DISK_WRITABLE (1 << FD_DISK_WRITABLE_BIT)
+#define FD_DCL_SEEN (1 << FD_DCL_SEEN)
unsigned long spinup_date;
unsigned long select_date;
diff -ur 2.4.13/linux/include/linux/fs.h linux/include/linux/fs.h
--- 2.4.13/linux/include/linux/fs.h Sat Oct 27 11:53:28 2001
+++ linux/include/linux/fs.h Sat Oct 27 13:05:03 2001
@@ -793,6 +793,7 @@
int (*ioctl) (struct inode *, struct file *, unsigned, unsigned long);
int (*check_media_change) (kdev_t);
int (*revalidate) (kdev_t);
+ int (*can_trust_media_change) (kdev_t);
};
/*
On Sat, 27 Oct 2001, Alain Knaff wrote:
> Appended to this mail is a patch meant to fix the "non-cached floppy"
> problem.
a) you _still_ need to stop all pending IO (readaheads in progress)
before final ->release()
b) at which point do you flush the cache? It definitely shouldn't
survive rmmod. And no, unregister_blkdev() is not a solution, courtesy
of devfs with its insane devfs=only option.
There is a related problem which is much nastier than short-living caches:
code that does bdev->bd_op = <stuff from devfs>; blkdev_get(bdev, ...);
Think what happens if rmmod comes while blkdev_get() sleeps ob ->bd_sem.
Notice that it had been there since the moment when devfs went into the
tree. Sigh...
>
>
>On Sat, 27 Oct 2001, Alain Knaff wrote:
>
>> Appended to this mail is a patch meant to fix the "non-cached floppy"
>> problem.
>
>a) you _still_ need to stop all pending IO (readaheads in progress)
>before final ->release()
Ok. Didn't know about that. Maybe we should add a variant of
invalidate_bdev which stops IO in progress, but which wouldn't throw
away data that has already been read? This same variant could also
make sure that no dirty buffers were around, as a sanity check.
>b) at which point do you flush the cache? It definitely shouldn't
>survive rmmod.
In the case of the floppy driver, the following would happen: after
rmmod, followed by insmod, the floppy driver would assume the disk has
been changed an throw away the cache (by returning 1 from
check_media_change). This is actually a side-effect of the way how the
physical DCL status is cleared: after an insmod, the floppy driver has
no way of knowing whether DCL was set at some point in the past, but
then cleared by another driver (i.e. ftape), so it errs on the safe
side, and assumes the disk has been changed.
Other block device drivers may behave differently of course...
> And no, unregister_blkdev() is not a solution, courtesy
>of devfs with its insane devfs=only option.
Do you mean this:
int devfs_unregister_blkdev (unsigned int major, const char *name)
{
if (boot_options & OPTION_ONLY) return 0;
return unregister_blkdev (major, name);
} /* End Function devfs_unregister_blkdev */
It seems to me that the root of the problem here is that the normal
major->device mapping is no longer the only way to "find" a device.
Cursory examination of floppy.c (as an example of a block device
driver) showed that bdops are also registered using devfs_register and
register_disk (what's THAT for?!? Floppies don't have partitions...)
Apparently, devfs_register allows a direct mapping from the device's
name to its driver, without going through its major/minor number.
Thus, a possible solution would be to equip all possible paths leading
to the driver's block_device_operations with correct "teardown"
function. Thus, not only unregister_blkdev would dump the cache, but
also devfs_unregister (maybe near the place in unregister() where
de->u.fcb.ops = NULL is done?). Best make this call generic, such as
free_bd_ops(handle), so that other unregister actions could easily be
tucked on top of it later on, should the need arise.
All this begs of course the following question: what kind of
idnetifier does the buffer cache code actually use to refer to the
block device, if there is no longer a major?
Is a pointer to some kind of structure used as the identifier? If so,
what structure is this? The device driver's block_device_operations
structure? No, won't work, because there may be minors. Or is it
something which is allocated on the fly when the device is opened? If
it is indeed the latter, I think I understand why the cache is gone
after re-opening the device: maybe it just gets a different handle
(mapping)? The cache may still be there, but the VFS would no longer
"find" it because the identifier of the device has changed. Should we
maybe use a _combination_ of pointer to block_device_operations and
minor number as the identifier for the device? That should stay valid
all the time while the module is loaded.
>There is a related problem which is much nastier than short-living caches:
>code that does bdev->bd_op = <stuff from devfs>; blkdev_get(bdev, ...);
>Think what happens if rmmod comes while blkdev_get() sleeps ob ->bd_sem.
>Notice that it had been there since the moment when devfs went into the
>tree. Sigh...
I'm not sure I understand the full extent of the problem, but at a
quick glance, it seems to be that the only operation exposed would be
blkdev_open, as all other operations do need an open handle, which
would prevent rmmod because module usage count would not be zero (at
least, if the module is correctly implemented...).
We would thus need to protect bd_op between the time it has been
fetched via get_blkfops, and bd_op->open. After bd_op->open, we would
be safe because of module usage count.
We could either use bdev->bd_sem (awkward, as many drivers implement
multiple bdev's), or a new per-major device lock to protect that
section.
unregister_blkdev would need to acquire the same lock while zero-ing
blkdevs[major].bdops.
Right now, block_dev.c's do_open already holds the kernel lock, so
wouldn't it be enough to also lock_kernel() in unregister_blkdev ?
A symetric situation might occur if a bd_op was called after
bd_op->release. But as far as I could see, this is not the case. Even
my own patch added the call to bd_op->can_trust_media_change _before_
the bd_op->release.
Ok, so the above only describes the solution where a device is
accessed via its major. To transpose this to devfs' direct
name-to-device mappings, a similar lock would need to be held between
the time the bd_op pointer is established, and the time it is
opened. But this may not practical, because too many software
components would be involved in that critical section. So maybe
do_open could "revalidate" the bd_op pointer (i.e. somehow check
whether the block device driver has not yet been unregistered in the
meantime) just before using it, and have the critical section only
enclose the revalidate and bd_op->open operation.
Something like the following (do_open):
down(&bdev->bd_sem);
lock_kernel();
+ if(bdev->bd_op && !devfs_is_bdop_still_valid(inode,bdev)) {
+ unlock_kernel();
+ up(&bdev->bd_sem);
+ return -ENODEV;
+ }
if (!bdev->bd_op)
bdev->bd_op = get_blkfops(MAJOR(dev));
if (bdev->bd_op) {
ret = 0;
if (bdev->bd_op->open)
ret = bdev->bd_op->open(inode, file);
if (!ret) {
bdev->bd_openers++;
bdev->bd_inode->i_size = blkdev_size(dev);
bdev->bd_inode->i_blkbits = blksize_bits(block_size(dev));
} else if (!bdev->bd_openers)
bdev->bd_op = NULL;
}
unlock_kernel();
up(&bdev->bd_sem);
Regards,
Alain
On Sat, 27 Oct 2001, Alain Knaff wrote:
> Cursory examination of floppy.c (as an example of a block device
> driver) showed that bdops are also registered using devfs_register and
> register_disk (what's THAT for?!? Floppies don't have partitions...)
Actually, _that_ is Right Thing(tm) - it should allocate a structure that
would contain pointer to methods table and would be controlled by
driver. devfs_register() would get that + prefered name, etc. so that
we had a common object. Then driver would have a point where it could
tell the rest of kernel that disk is gone.
> Apparently, devfs_register allows a direct mapping from the device's
> name to its driver, without going through its major/minor number.
>
> Thus, a possible solution would be to equip all possible paths leading
> to the driver's block_device_operations with correct "teardown"
> function. Thus, not only unregister_blkdev would dump the cache, but
> also devfs_unregister (maybe near the place in unregister() where
> de->u.fcb.ops = NULL is done?). Best make this call generic, such as
No go. We can have situations where some of uses come from devfs and
some - from normal device nodes. struct block_device will be the same.
> All this begs of course the following question: what kind of
> idnetifier does the buffer cache code actually use to refer to the
> block device, if there is no longer a major?
Right now - major:minor, in 2.5 - struct block_device *.
> We could either use bdev->bd_sem (awkward, as many drivers implement
> multiple bdev's), or a new per-major device lock to protect that
> section.
I'd rather have refcount raised by get_blkfops(). Again, that code path
is not a problem. devfs_get_ops() is.
> unregister_blkdev would need to acquire the same lock while zero-ing
> blkdevs[major].bdops.
We could put bdev on per-major cyclic list and have it killed on
unregister_blkdev(). _That_ is easy. The trouble being, with devfs
we don't have a single removal point. Sometimes it's still
unregister_blkdev(), sometimes - crapload of devfs_unregister() for
each minor, sometimes - both. Worse yet, we have one more place that
holds pointer to block_device_operations - gendisk. Also used by
devfs (and nothing else) and logics is, to put it mildly, fuzzy.
Frankly, at that point I would prefer to remove the code in devfs that
tries to provide bdev methods by devfs entry. Rationale:
a) it's fucked up beyond any repair
b) it will be useless until we switch buffer cache to block_device *
c) we will need to change that logics anyway - as it is the thing is
inherently racy
d) right now it stands in the way of long-living cache stuff _and_
introduces an oopsable race between mount and rmmod.
>
>
>On Sat, 27 Oct 2001, Alain Knaff wrote:
>
>> Cursory examination of floppy.c (as an example of a block device
>> driver) showed that bdops are also registered using devfs_register and
>> register_disk (what's THAT for?!? Floppies don't have partitions...)
>
>Actually, _that_ is Right Thing(tm) - it should allocate a structure that
>would contain pointer to methods table and would be controlled by
>driver. devfs_register() would get that + prefered name, etc. so that
>we had a common object. Then driver would have a point where it could
>tell the rest of kernel that disk is gone.
Register_disk seems to be related to partitions... and is yet another
place where floppy_fops is handed out. And it doesn't seem to have a
corresponding unregister_disk function, so this worries me
somewhat. Who did that, and why didn't he contact me?
>> Apparently, devfs_register allows a direct mapping from the device's
>> name to its driver, without going through its major/minor number.
>>
>> Thus, a possible solution would be to equip all possible paths leading
>> to the driver's block_device_operations with correct "teardown"
>> function. Thus, not only unregister_blkdev would dump the cache, but
>> also devfs_unregister (maybe near the place in unregister() where
>> de->u.fcb.ops = NULL is done?). Best make this call generic, such as
>
>No go. We can have situations where some of uses come from devfs and
>some - from normal device nodes. struct block_device will be the same.
Ok. So maybe some kind of counter? When it drops to zero, dump the
cache?
>> All this begs of course the following question: what kind of
>> idnetifier does the buffer cache code actually use to refer to the
>> block device, if there is no longer a major?
>
>Right now - major:minor, in 2.5 - struct block_device *.
Good. But then, what's the point of devfs=only ? I assumed this was
intended for situations where we had a direct mapping from filename to
device.
Ok, so in 2.5 will be possible with struct block_device, and the
option will make sense.
So, in the interest of stability, shouldn't we (temporarily) disable
this devfs=only stuff in 2.4 ?
>> We could either use bdev->bd_sem (awkward, as many drivers implement
>> multiple bdev's), or a new per-major device lock to protect that
>> section.
>
>I'd rather have refcount raised by get_blkfops(). Again, that code path
>is not a problem. devfs_get_ops() is.
>
>> unregister_blkdev would need to acquire the same lock while zero-ing
>> blkdevs[major].bdops.
>
>We could put bdev on per-major cyclic list and have it killed on
>unregister_blkdev(). _That_ is easy. The trouble being, with devfs
>we don't have a single removal point. Sometimes it's still
>unregister_blkdev(), sometimes - crapload of devfs_unregister() for
>each minor, sometimes - both. Worse yet, we have one more place that
>holds pointer to block_device_operations - gendisk. Also used by
>devfs (and nothing else) and logics is, to put it mildly, fuzzy.
>
>Frankly, at that point I would prefer to remove the code in devfs that
>tries to provide bdev methods by devfs entry. Rationale:
> a) it's fucked up beyond any repair
> b) it will be useless until we switch buffer cache to block_device *
> c) we will need to change that logics anyway - as it is the thing is
>inherently racy
> d) right now it stands in the way of long-living cache stuff _and_
>introduces an oopsable race between mount and rmmod.
>
I agree. We could maybe just #ifdef those methods out, so that we
could easily add them back in 2.5 once struct block_device is in
place.
Regards,
Alain
On Sat, 27 Oct 2001, Alain Knaff wrote:
> Register_disk seems to be related to partitions... and is yet another
> place where floppy_fops is handed out. And it doesn't seem to have a
> corresponding unregister_disk function, so this worries me
> somewhat. Who did that, and why didn't he contact me?
In most of the cases (and that includes floppy.c) unregister_disk() would
be invoked by unregister_blkdev(). The only exception (back then, now
we probably have more) would be SCSI. That stuff got frozen in the middle
of a merge by devfs inclusion (2.3.46 or so). Right now register_disk()
is no-op for partitionless devices.
> Good. But then, what's the point of devfs=only ? I assumed this was
Ask Richard. Maybe you will be able to get a straight answer. I hadn't...
BTW, here's one more devfs rmmod race: check_disk_changed() in
fs/devfs/base.c. Calling ->check_media_change() with no protection
whatsoever. If rmmod happens at that point...
BTW, what the hell is that?
/*
* hwgraph_bdevsw_get - returns the fops of the given devfs entry.
*/
struct file_operations *
hwgraph_bdevsw_get(devfs_handle_t de)
{
return(devfs_get_ops(de));
}
It's arch/ia64/sn/io/hcl.c. The funny thing being, the thing
you will get from devfs_get_ops() will _not_ be struct file_operations *.
And that's aside of the fact that any use of that function is very
likely to be racy as hell. Sigh...
>> All this begs of course the following question: what kind of
>> idnetifier does the buffer cache code actually use to refer to the
>> block device, if there is no longer a major?
>
>Right now - major:minor, in 2.5 - struct block_device *.
Actually, are you sure about that?
In filemap.c in do_generic_file_read, you have this:
struct address_space *mapping = filp->f_dentry->d_inode->i_mapping;
...
/*
* Try to find the data in the page cache..
*/
hash = page_hash(mapping, index);
spin_lock(&pagecache_lock);
page = __find_page_nolock(mapping, index, *hash);
Looks as if pages are identified by (mapping, index) pairs, rather
than (major:minor, index) triplets.
And "mapping" itself seems to point to i_data of the device's inode
structure (not the device entry's inode, but the device's itself).
Which means that if the inode is put (all references to the block
device closed), and later the same major/minor is reopened, it may
very well be in a different place in memory... Meaning that the
formerly cached pages (if they are still there...) will no longer be
recognized as the same, as mapping is now a different address, even
though major/minor is the same.
The changing nature of mapping can be confirmed easily by inserting a
printk into the open routine of the device.
As long as at least one reference to the device is open, mapping stays
the same for further opens, but as soon as the last ref is closed, the
next open will return a different address...
For the following test, I've inserted the following debugging printk
in floppy_open:
printk("Mapping=%x data=%x\n", inode->i_mapping, &inode->i_data);
As you see, while the sleep is open, Mapping is always
c25a7a90. However, once the sleep is closed, Mapping changes to
c1ba0cd0 . This seems to explain it things.
> sleep 3600 </dev/fd0 &
[1] 1449
Mapping=c25a7a90 data=cb80dc50
VFS: Disk change detected on device fd(2,0)
> mdir a:
Mapping=c25a7a90 data=cb80dc50
Volume in drive A has no label
Volume Serial Number is 63F7-FED8
Directory for A:/
No files
1 457 664 bytes free
> mdir a:
Mapping=c25a7a90 data=cb80dc50
Volume in drive A has no label
Volume Serial Number is 63F7-FED8
Directory for A:/
No files
1 457 664 bytes free
> kill %1
[1] + 1449 terminated sleep 3600 < /dev/fd0
> mdir a:
Mapping=c1ba0cd0 data=cb80dc50
Volume in drive A has no label
Volume Serial Number is 63F7-FED8
Directory for A:/
No files
1 457 664 bytes free
Regards,
Alain
It just gets better and better...
in get_removable_partition()
if (strcmp (de->name, "disc") == 0) return check_disc_changed (de);
with
if ( name && (namelen < 1) ) namelen = strlen (name);
if ( ( new = kmalloc (sizeof *new + namelen, GFP_KERNEL) ) == NULL )
return NULL;
...
if (name) memcpy (new->name, name, namelen);
in create_entry().
IOW, ->name is not NUL-terminated.
On Sat, 27 Oct 2001, Alain Knaff wrote:
> And "mapping" itself seems to point to i_data of the device's inode
> structure (not the device entry's inode, but the device's itself).
> Which means that if the inode is put (all references to the block
> device closed), and later the same major/minor is reopened, it may
Stop here. bdev->bd_inode is destroyed only when bdev is destroyed.
If we make block_device long-living (i.e. they stay around until all
pages are evicted from cache _or_ device gets unregistered) ->bd_inode
will follow.
>
>
>On Sat, 27 Oct 2001, Alain Knaff wrote:
>
>> And "mapping" itself seems to point to i_data of the device's inode
>> structure (not the device entry's inode, but the device's itself).
>
>> Which means that if the inode is put (all references to the block
>> device closed), and later the same major/minor is reopened, it may
>
>Stop here. bdev->bd_inode is destroyed only when bdev is destroyed.
>If we make block_device long-living (i.e. they stay around until all
>pages are evicted from cache _or_ device gets unregistered) ->bd_inode
>will follow.
That would be ok with me. Long live block_device! ;-)
Alain
devfs_rmdir() checks that directory is empty. Then it calls
devfsd_notify_one(), which can block. Then it marks the entry
unregistered and reports success.
Guess what will happen if devfs_register() will happen at that
moment...
/me _seriously_ considers hostile takeover of the damn thing
I mean, when holes are found at that rate just by cursory look through
the code... And that stuff had been there for at least 2 years.
Richard, I hate to break it on you, but dealing with that crap is
generally considered a maintainer's job. And it is supposed to happen
slightly faster - 2 years would be bad even for MS. If you don't
have time for that work - say so and step down, nobody will blame
you for that. Repeating that everything will be fine RSN is getting
real old by now...
... and one more - devfs_unregister() on a directory happening when
mknod() in that directory sleeps in create_entry() (on kmalloc()).
Do you ever read your own code? It's not like this stuff was hard
to find - I'm just poking into random places and every damn one
contains a hole. Sigh...
Oh, BTW - here's another one: think what happens if tree-walking in
unregister() steps on the entry we are currently removing in
devfs_unlink().
On Sat, 27 Oct 2001, Alexander Viro wrote:
> devfs_rmdir() checks that directory is empty. Then it calls
> devfsd_notify_one(), which can block. Then it marks the entry
> unregistered and reports success.
>
> Guess what will happen if devfs_register() will happen at that
> moment...
Ugh... My apologies - race here is a bit different. Namely,
devfs_register() find a directory, starts creating a child,
blocks in kmalloc(), _then_ entire devfs_rmdir() happens and
devfs_register() merrily inserts a new child into dead directory.
On October 27, 2001 13:06, Alexander Viro wrote:
> Do you ever read your own code? It's not like this stuff was hard
> to find - I'm just poking into random places and every damn one
> contains a hole. Sigh...
It might be more productive to provide patches or at least generate
constructive ideas as how to fix these problems, as you are obviously quite
capable of doing so. Digging through the code and sending a new email to this
list for every few flaws you find is doing little good, and your personal
attacks on the maintainer are even less benefical. Cooperation will get you a
lot farther than conflict.
-Ryan
On Sat, 27 Oct 2001, Ryan Cumming wrote:
> It might be more productive to provide patches or at least generate
> constructive ideas as how to fix these problems, as you are obviously quite
> capable of doing so. Digging through the code and sending a new email to this
> list for every few flaws you find is doing little good, and your personal
> attacks on the maintainer are even less benefical. Cooperation will get you a
> lot farther than conflict.
Been there, tried that, had been told by Richard that he would rather fix
devfs bugs himself. Quite a few months ago. If you have better suggestions
they would be more than welcome.
As far as I can see, if maintainer doesn't fix the bugs himself and tells
that patches are not welcome there are only two things that can be done -
going into full-disclosure mode in hope that it will change the situation or
taking over the code in question.
By that point I'm sorely tempted to do the latter (i.e. full-blown fork,
maintained with no regard to Richard's preferences + sumbitting [very
massive] fixes directly to Linus), but I want to give a try to less
drastic approach first.
On Sat, 27 Oct 2001, Ryan Cumming wrote:
> It might be more productive to provide patches or at least generate
> constructive ideas as how to fix these problems, as you are obviously
> quite capable of doing so.
1) yes, Al Viro is very capable of sending in devfs fixes
and he has done so in the past (IIRC around 2 months ago)
2) Richard Gooch then told Al he'd just started working on
a patch to fix the problem and he'd rather fix things
himself ... as far as I can see this hasn't happened yet
regards,
Rik
--
DMCA, SSSCA, W3C? Who cares? http://thefreeworld.net/
http://www.surriel.com/ http://distro.conectiva.com/
Rik van Riel writes:
> On Sat, 27 Oct 2001, Ryan Cumming wrote:
>
> > It might be more productive to provide patches or at least generate
> > constructive ideas as how to fix these problems, as you are obviously
> > quite capable of doing so.
>
> 1) yes, Al Viro is very capable of sending in devfs fixes
> and he has done so in the past (IIRC around 2 months ago)
A truely horrible, busy-wait patch that was quickly superceeded by a
much cleaner patch that I wrote shortly thereafter. And was applied by
Linus in due course.
> 2) Richard Gooch then told Al he'd just started working on
> a patch to fix the problem and he'd rather fix things
> himself ... as far as I can see this hasn't happened yet
Complete fucking bullshit. Over the last several months, I've been
sending a steady stream of bugfix patches to Linus and the list, and
if you'd been paying attention, you'd notice that in time they've been
applied.
Furthermore, I've nearly finished the big rewrite of devfs which adds
proper locking and refcounting. That work was progressing nicely (but
it's a big job), although it's temporarily stalled because of some
important travel. Work on that will resume in the next couple of
weeks. There's no point sending in an incomplete version of the code.
It's beyond me why you state that there has been no progress by me
when my announcements of new devfs patches have been posted to the
list and even Linus' ChangeLog messages have shown stuff going in. If
you don't actually know what's going on, why do you bother posting on
this subject in the first place? How would you like it if I started
flaming about how long the VM code was taking to get working? Our VM
has sucked for *years*.
Regards,
Richard....
Permanent: [email protected]
Current: [email protected]
On Sun, 28 Oct 2001, Richard Gooch wrote:
> Complete fucking bullshit. Over the last several months, I've been
> sending a steady stream of bugfix patches to Linus and the list, and
> if you'd been paying attention, you'd notice that in time they've been
> applied.
OK, _now_ I'm really pissed off. As far as I can see there is only
one way to get you fix anything - posting to l-k. This "steady
stream" consists of what? Let's see:
0.118: buffer underrun in try_modload(). Source: some Al Viro had hit
the function in question in grep over tree and took a couple of minutes
to read it.
0.118: moving down_read() - yes, it had fixed the instance of deadlock
pointed to you by, damn, what a coincidence, same bastard. Come to think
of that, let me grep for down_read()... Aha.
static int devfs_follow_link (struct dentry *dentry, struct nameidata *nd)
{
int err;
struct devfs_entry *de;
de = get_devfs_entry_from_vfs_inode (dentry->d_inode, TRUE);
if (!de) return -ENODEV;
down_read (&symlink_rwsem);
err = de->registered ? vfs_follow_link (nd,
de->u.symlink.linkname) : -ENODEV;
up_read (&symlink_rwsem);
return err;
} /* End Function devfs_follow_link */
Umm... Hadn't we just been there? Recursive down_read(&symlink_rwsem)...
0.117: oh, wow - finally. devfs_link() is gone.
0.116: reverted previous broken patch, but result contained a deadlock instead
of race. Result of race scenario described on l-k by... damn, this asshole
again.
0.115: bogus fix for breakage introduced by blkdev-in-pagecache patch. Hadn't
got into Linus' tree, actually.
0.114: introduced broken refcounting for symlinks (see 0.116)
0.113: "quick and dirty hack" to protect symlink bodies. Broken, at that.
BTW, breakage in 0.113 and 0.114 hadn't stopped Mandrake from deciding that
it fixed readlink() race and shipping the thing. Funny, but race it was
supposed to fix had been described in private email several months before.
Then it was described on l-k. Then description had been forwarded to Mandrake
- after a question about potential breakage. _Then_ (and I assume that it
was a coincidence) said patches had appeared.
0.111, 0.112: not a fix by any stretch of imagination.
Oh, and before that we have a (finally, only after a year of mentioning
the crap in question, heavy-weight rant on l-k when I've finally ran out
of patience _and_ detailed discussion on the possible fixes) fix for
expand-entry-table races.
So far all I see is that beating you hard enough in public can make you
fix the bugs explicitly pointed to you. That's it. As far as I can
see you don't read your own code, judging by the fact that every damn
look at fs/devfs/base.c shows a new hole within a couple of minutes
_and_ said holes stay until posted on l-k. Private mail doesn't work.
You read it, reply and ignore. About hundred kilobytes of evidence
available at request.
At 15.50 27/10/01 -0400, Alexander Viro wrote:
>/me _seriously_ considers hostile takeover of the damn thing
>
>I mean, when holes are found at that rate just by cursory look through
>the code... And that stuff had been there for at least 2 years.
>Richard, I hate to break it on you, but dealing with that crap is
>generally considered a maintainer's job. And it is supposed to happen
>slightly faster - 2 years would be bad even for MS. If you don't
>have time for that work - say so and step down, nobody will blame
>you for that. Repeating that everything will be fine RSN is getting
>real old by now...
As matters stand I would suggest a radical approach:
What about to replace the current devfs by a clean and lightweight
ramfs based filesystem?
IMO we need a complete rewrite and I'm sure we can make it simpler
and cleaner. 2.5 stuff, but it's worthwhile.
Just my 0.02 Euro,
--
Lorenzo
Hi,
Richard Gooch wrote:
> Furthermore, I've nearly finished the big rewrite of devfs which adds
> proper locking and refcounting. That work was progressing nicely (but
> it's a big job), although it's temporarily stalled because of some
> important travel. Work on that will resume in the next couple of
> weeks. There's no point sending in an incomplete version of the code.
What about putting them somewhere in a CVS repository, so people can see
what's going on and maybe even can help out?
BTW you should really do something about your coding style, your code is
very confusing to read. I wouldn't care if it would be just some driver,
but devfs is supposed to be a very important part, so it would be nice
to use the same rules that apply to other important parts of the kernel.
bye, Roman
On Sun, 28 Oct 2001, Roman Zippel wrote:
> What about putting them somewhere in a CVS repository, so people can see
> what's going on and maybe even can help out?
Looks like I'll get around to creating a CVS repository starting at the
last known code in a couple of days anyway...
> BTW you should really do something about your coding style, your code is
> very confusing to read. I wouldn't care if it would be just some driver,
> but devfs is supposed to be a very important part, so it would be nice
> to use the same rules that apply to other important parts of the kernel.
Good luck.
BTW, Richard - the last one for tonight:
devfs_unregister() vs. get_vfs_inode(). The latter blocks, so devfs_lookup()
and devfs_d_revalidate() can give you a nasty surprise - entry gets
unregistered while we allocate the inode and there's no connection between
it and inode or dentry at that point. Then we merrily get dentry/inode
tied to unregistered devfs_entry. And that includes reference _to_ dentry.
Enjoy...
Appended to this mail is the "long live the struct block_device"
patch. It includes the stuff covered in the last patch as well. The
issue of stopping transfers in progress is not yet addressed.
How it works: rather than have bdput free up the struct block_device
unconditionnally as soon as the count drops to zero, forcefully
bd_forgetting any inodes that still reference it, it now checks the
list of inodes references it, and only frees it up if not referenced.
Moreover, bd_forget also calls __bdput to free up the struct
block_device when the number of referencers drops to zero later on.
Simple tests show this to make the cache workeable again, and an extra
printk in floppy_open allows to show that the struct block_device can
be freed up if needed (by creating high memory pressure).
diff -ur 2.4.13/linux/drivers/block/floppy.c linux/drivers/block/floppy.c
--- 2.4.13/linux/drivers/block/floppy.c Sat Oct 27 09:42:34 2001
+++ linux/drivers/block/floppy.c Sun Oct 28 21:21:46 2001
@@ -741,6 +741,7 @@
return UTESTF(FD_DISK_CHANGED);
if ((fd_inb(FD_DIR) ^ UDP->flags) & 0x80){
USETF(FD_VERIFY); /* verify write protection */
+ USETF(FD_DCL_SEEN); /* DCL for this drive has been seen */
if (UDRS->maxblock){
/* mark it changed */
USETF(FD_DISK_CHANGED);
@@ -3901,12 +3902,23 @@
return 0;
}
+/* determines whether we can trust media change
+ * answers true if we have seen disk change at least once, and not marked as
+ * broken */
+static int floppy_can_trust_media_change(kdev_t dev)
+{
+ int drive=DRIVE(dev);
+ return (UTESTF(FD_DCL_SEEN) &&
+ ((UDP->flags & FD_BROKEN_DCL) == 0));
+}
+
static struct block_device_operations floppy_fops = {
open: floppy_open,
release: floppy_release,
ioctl: fd_ioctl,
check_media_change: check_floppy_change,
revalidate: floppy_revalidate,
+ can_trust_media_change: floppy_can_trust_media_change
};
static void __init register_devfs_entries (int drive)
diff -ur 2.4.13/linux/fs/block_dev.c linux/fs/block_dev.c
--- 2.4.13/linux/fs/block_dev.c Sun Oct 28 09:52:53 2001
+++ linux/fs/block_dev.c Sun Oct 28 21:11:08 2001
@@ -346,19 +346,36 @@
inode->i_mapping = &inode->i_data;
}
+/**
+ * Physically delete bdev iff inode list empty AND counter zero. Must be
+ * called with bdev_lock held. bdev_lock will be released on exit.
+ * Should be called by all functions which can make either of the conditions
+ * true (decrease counter (bdput) or free inode (bdforget)
+ */
+static void __bdput(struct block_device *bdev) {
+ int shouldDrop =
+ (bdev->bd_inodes.next == &bdev->bd_inodes) &&
+ (atomic_read(&bdev->bd_count) == 0);
+ if(shouldDrop) {
+ if (bdev->bd_openers)
+ BUG();
+ list_del(&bdev->bd_hash);
+ }
+ spin_unlock(&bdev_lock);
+ if(shouldDrop) {
+ if (bdev->bd_openers)
+ BUG();
+ iput(bdev->bd_inode);
+ destroy_bdev(bdev);
+ }
+}
+
void bdput(struct block_device *bdev)
{
if (atomic_dec_and_lock(&bdev->bd_count, &bdev_lock)) {
- struct list_head *p;
if (bdev->bd_openers)
BUG();
- list_del(&bdev->bd_hash);
- while ( (p = bdev->bd_inodes.next) != &bdev->bd_inodes ) {
- __bd_forget(list_entry(p, struct inode, i_devices));
- }
- spin_unlock(&bdev_lock);
- iput(bdev->bd_inode);
- destroy_bdev(bdev);
+ __bdput(bdev);
}
}
@@ -390,9 +407,17 @@
void bd_forget(struct inode *inode)
{
+ struct block_device *bdev;
spin_lock(&bdev_lock);
- if (inode->i_bdev)
+ bdev = inode->i_bdev;
+ if (bdev) {
__bd_forget(inode);
+ if(inode != bdev->bd_inode) {
+ /* if not "root" inode, do the bdput */
+ __bdput(bdev);
+ return;
+ }
+ }
spin_unlock(&bdev_lock);
}
@@ -601,8 +626,12 @@
__block_fsync(bd_inode);
else if (kind == BDEV_FS)
fsync_no_super(rdev);
- if (!--bdev->bd_openers)
- kill_bdev(bdev);
+ if (!--bdev->bd_openers) {
+ if(!bdev->bd_op->can_trust_media_change ||
+ !bdev->bd_op->can_trust_media_change(rdev)) {
+ kill_bdev(bdev);
+ }
+ }
if (bdev->bd_op->release)
ret = bdev->bd_op->release(bd_inode, NULL);
if (!bdev->bd_openers)
diff -ur 2.4.13/linux/include/linux/fd.h linux/include/linux/fd.h
--- 2.4.13/linux/include/linux/fd.h Fri Aug 13 21:16:16 1999
+++ linux/include/linux/fd.h Sat Oct 27 13:37:08 2001
@@ -177,7 +177,8 @@
* to clear media change status */
FD_UNUSED_BIT,
FD_DISK_CHANGED_BIT, /* disk has been changed since last i/o */
- FD_DISK_WRITABLE_BIT /* disk is writable */
+ FD_DISK_WRITABLE_BIT, /* disk is writable */
+ FD_DCL_SEEN_BIT /* have we seen DCL? */
};
#define FDSETDRVPRM _IOW(2, 0x90, struct floppy_drive_params)
@@ -196,6 +197,7 @@
#define FD_DISK_NEWCHANGE (1 << FD_DISK_NEWCHANGE_BIT)
#define FD_DISK_CHANGED (1 << FD_DISK_CHANGED_BIT)
#define FD_DISK_WRITABLE (1 << FD_DISK_WRITABLE_BIT)
+#define FD_DCL_SEEN (1 << FD_DCL_SEEN)
unsigned long spinup_date;
unsigned long select_date;
diff -ur 2.4.13/linux/include/linux/fs.h linux/include/linux/fs.h
--- 2.4.13/linux/include/linux/fs.h Sat Oct 27 11:53:28 2001
+++ linux/include/linux/fs.h Sat Oct 27 13:05:03 2001
@@ -793,6 +793,7 @@
int (*ioctl) (struct inode *, struct file *, unsigned, unsigned long);
int (*check_media_change) (kdev_t);
int (*revalidate) (kdev_t);
+ int (*can_trust_media_change) (kdev_t);
};
/*
"A month of sundays ago Alain Knaff wrote:"
> Appended to this mail is the "long live the struct block_device"
> patch. It includes the stuff covered in the last patch as well. The
> issue of stopping transfers in progress is not yet addressed.
Errr .. haven't read the patches. But are you doing something so that
the semantics of the action taken after a media check fails can
be overridden? The current invalidate_inodes is too strong for me,
since I am proxying a remote device, and I don't want to kill
_all_ the local file descriptors when the remote media disappears.
I need to at least continue to send down local ioctls!
No, no suggestions of an extra control device, please. Simplicity.
> +
> static struct block_device_operations floppy_fops = {
> open: floppy_open,
> release: floppy_release,
> ioctl: fd_ioctl,
> check_media_change: check_floppy_change,
> revalidate: floppy_revalidate,
> + can_trust_media_change: floppy_can_trust_media_change
> };
and I'd like to have "invalidate" as a method too.
Peter
On Sun, 28 Oct 2001, Alain Knaff wrote:
> Appended to this mail is the "long live the struct block_device"
> patch. It includes the stuff covered in the last patch as well. The
> issue of stopping transfers in progress is not yet addressed.
Actually, both issues are not addressed. With your patch bdev will
happily live after rmmod. Please, wait a bit with that stuff. I'll
post a variant tonight - still need to verify that it deals correctly
with cases like SCSI (disk going away without unregister_blkdev()).
Alexander Viro writes:
>
>
> On Sun, 28 Oct 2001, Richard Gooch wrote:
>
> > Complete fucking bullshit. Over the last several months, I've been
> > sending a steady stream of bugfix patches to Linus and the list, and
> > if you'd been paying attention, you'd notice that in time they've been
> > applied.
>
> OK, _now_ I'm really pissed off. As far as I can see there is only
> one way to get you fix anything - posting to l-k. This "steady
> stream" consists of what? Let's see:
[...]
> Oh, and before that we have a (finally, only after a year of mentioning
> the crap in question, heavy-weight rant on l-k when I've finally ran out
> of patience _and_ detailed discussion on the possible fixes) fix for
> expand-entry-table races.
>
> So far all I see is that beating you hard enough in public can make
> you fix the bugs explicitly pointed to you. That's it. As far as I
> can see you don't read your own code, judging by the fact that every
> damn look at fs/devfs/base.c shows a new hole within a couple of
> minutes _and_ said holes stay until posted on l-k. Private mail
> doesn't work. You read it, reply and ignore. About hundred
> kilobytes of evidence available at request.
You don't get to see the bug reports or questions I respond to which
are sent to me privately or on the devfs list (I know you're not
subscribed:-). And you seem to have forgotten that I've responded to
questions or bug reports *from you* that you send privately to me,
sometimes Cc:ed to Linus. I've even responded to questions that you've
placed in the code. So it's simply not true that I only respond if
beat upon in public. Progress *is* being made, irrespective of your
"input".
As for the recent bug reports, yes, I've just seen them. I'll respond
(not because you've been flaming about it on the list) later this week
once I clear through my email backlog which accumulated while I was
off the 'net for a week. Yeah, it does take time to wade through all
the email, especially when I get greeted with a huge pile of flames.
Regards,
Richard....
Permanent: [email protected]
Current: [email protected]
On Mon, 29 Oct 2001, Richard Gooch wrote:
> Alexander Viro writes:
> > So far all I see is that beating you hard enough in public can make
> > you fix the bugs explicitly pointed to you. That's it. As far as I
> > can see you don't read your own code, judging by the fact that every
> > damn look at fs/devfs/base.c shows a new hole within a couple of
> > minutes _and_ said holes stay until posted on l-k. Private mail
> > doesn't work. You read it, reply and ignore. About hundred
^^^^^^^^^^^^^^^^^^^^^^^^^
> > kilobytes of evidence available at request.
>
> You don't get to see the bug reports or questions I respond to which
> are sent to me privately or on the devfs list (I know you're not
> subscribed:-). And you seem to have forgotten that I've responded to
> questions or bug reports *from you* that you send privately to me,
I see what made its way in your code and your changelog. As far as I
can see nothing contradicts description above.
1) You are maintainer of that code.
2) Couple of minutes of reading through it is enough to find a new hole.
3) There are dozens of such holes.
4) They had been there for years.
Conclusion:
You either can't see that stuff at all or you don't bother to spend even
minimal time looking for bugs.
See the problem with that situation?
>"A month of sundays ago Alain Knaff wrote:"
>> Appended to this mail is the "long live the struct block_device"
>> patch. It includes the stuff covered in the last patch as well. The
>> issue of stopping transfers in progress is not yet addressed.
>
>Errr .. haven't read the patches.
Please do ;-) Or maybe, rather browse the initial mails of this
thread, which discuss the problem being fixed here, and the reason why
that problem was there in the first place.
>But are you doing something so that
>the semantics of the action taken after a media check fails can
>be overridden? The current invalidate_inodes is too strong for me,
>since I am proxying a remote device, and I don't want to kill
>_all_ the local file descriptors when the remote media disappears.
>I need to at least continue to send down local ioctls!
>
>No, no suggestions of an extra control device, please. Simplicity.
Don't worry. This patch does not do anything like that. No extra
control device is introduced. No killing of local file descriptors in
case a remote media disappears. Please read the context of the post
(Subject: Poor floppy performance). The discussion has unfortunately
been stretched over more than a week, with long silences in between
(I've been busy with other stuff in between, and thus had to come back
later, sorry), so the initial messages might be somewhat hard to find,
but they're there in the archives:
http://www.uwsg.iu.edu/hypermail/linux/kernel/, in the week of Oct
16-23.
All the patch it does is address the following problem: if all file
descriptors pointing to a block device are closed, the cache is
closed, no matter whether the media has been changed or not... This
clearly leads to suboptimal performance, especially with slow devices
such as the floppy.
The reason for this behaviour was threefold
1. Apparently, for some media, check_media_change is not reliable, so
the VFS people decided not to trust it (don't ask...). Hence the
kill_bdev call, invoked after the last close, to throw away all
buffers
2. Kill_bdev is also needed to stop any read-aheads in progress after
last close (makes sense: if there is no open filedescriptors, those
read-aheads can no longer safely be served, because the device driver
may assume that nobody needs the device any longer and thus
de-allocate critical resources: IRQ, DMA, bounce buffers...)
3. Cached pages are identified by the pair (inode,index), and the
block device backend inode changes after the last close.
All the can_trust_media_change part does is skip the kill_bdev if the
device driver says "yes, you _can_ trust the answer of
check_media_change" (fix problem #1).
Problem number 2 is not yet addressed, Alex Viro is working on a patch
(probably by having a variant of invalidate_bdev that just stops
transfers in progress, loudly warn about dirty pages, but without
killing clean cached pages)
Number 3 is fixed by making struct block_device "long
lived". Formerly, it only existed as long as there were active open
descriptors using it; now it exists as long as there are frontend
inodes referencing it.
Another related issue (not yet addressed by my patch), is what happens
after rmmod. Right now, with the patch, the cache is so long-lived
that it hangs around even after an rmmod...
Personnally, I don't really like the "check_media_change not trusted"
semantics either... we better handle flaky media change indicators
internally in the driver, especially since the driver may have its own
limited internal caching (DMA bounce buffers, track buffers, ...). But
apparently other people have strong feelings about the issue, having
been burned badly by devices which occasionnally miss a media
change..., and who have chosen to ginore check_media_change
altogether, rather than fix those device drivers...
>> +
>> static struct block_device_operations floppy_fops = {
>> open: floppy_open,
>> release: floppy_release,
>> ioctl: fd_ioctl,
>> check_media_change: check_floppy_change,
>> revalidate: floppy_revalidate,
>> + can_trust_media_change: floppy_can_trust_media_change
>> };
>
>and I'd like to have "invalidate" as a method too.
>
>Peter
Regards,
Alain
On Mon, 29 Oct 2001, Alain Knaff wrote:
> Problem number 2 is not yet addressed, Alex Viro is working on a patch
> (probably by having a variant of invalidate_bdev that just stops
> transfers in progress, loudly warn about dirty pages, but without
> killing clean cached pages)
>
> Number 3 is fixed by making struct block_device "long
> lived". Formerly, it only existed as long as there were active open
> descriptors using it; now it exists as long as there are frontend
> inodes referencing it.
IMO that's bogus. Correct test is "do we have any data in page cache
for this guy?" (combined with "... and if driver says that it goes
away, we'd better drop them all"). See the patch I've just sent to you
and Linus - right now I consider it as too dangerous for public testing, so...
Notice that it _doesn't_ trust check_media_change() - it still has
both detach_metadata() (needed in all cases) and truncate_inode_pages()
(kills cache, needed only for drivers with b0rken ->check_media_change())
in blkdev_put(). You'll need to make the latter call conditional to
actually keep the cache around, but I'd really like to make sure that
infrastructure changes are sane before doing that step.
>
>
>On Mon, 29 Oct 2001, Alain Knaff wrote:
>
>> Problem number 2 is not yet addressed, Alex Viro is working on a patch
>> (probably by having a variant of invalidate_bdev that just stops
>> transfers in progress, loudly warn about dirty pages, but without
>> killing clean cached pages)
>>
>> Number 3 is fixed by making struct block_device "long
>> lived". Formerly, it only existed as long as there were active open
>> descriptors using it; now it exists as long as there are frontend
>> inodes referencing it.
>
>IMO that's bogus.
As far as I understood, the same logic is used for all other kinds of
inodes, isn't it? I agree that it is suboptimal (might stay around
longer than its pages), but at least it works in simple cases
(i.e. stays cached normally, and is evicted from cache under memory
pressure).
>Correct test is "do we have any data in page cache
>for this guy?" (combined with "... and if driver says that it goes
>away, we'd better drop them all"). See the patch I've just sent to you
>and Linus - right now I consider it as too dangerous for public testing, so...
Got the patch... but only after I sent that mail to Peter. I'll have a
look at it, and try to test it. But due to other activities, I'm
afraid I'll only be able to get around to it tomorrow evening.
>Notice that it _doesn't_ trust check_media_change() - it still has
>both detach_metadata() (needed in all cases) and truncate_inode_pages()
>(kills cache, needed only for drivers with b0rken ->check_media_change())
>in blkdev_put(). You'll need to make the latter call conditional
With your patch, will making that call conditional still correctly
stop any transfers which are already in progress? (a problem that has
existed with my patches)? Or does this problem need to be addressed
separately?
>to
>actually keep the cache around, but I'd really like to make sure that
>infrastructure changes are sane before doing that step.
Ok, I'll give it a spin tomorrow evening.
Btw, the more I think about the check_media_change issue, the more I
come to the conclusion that the cases of broken media change detection
are best dealt with from within the driver. Indeed, firstly why
encumber the vast majority of drivers, which are non-broken, or which
already deal with the situation, just in order to handle a few bad
apples? Secondly, the broken drivers might have internal caches (track
buffers, etc.), and just ignoring check_media_change _outside_ the
driver won't fix all of the problem, as the internal caches will not
be invalidated.
Thus, may I suggest the following course of action in case a missed
media change detection problem with a driver: make a good-faith effort
to contact its maintainer, and have the problem fixed _in_the_driver_
rather than to have media change ignored centrally. Same goes for
other issues as well: work _with_ the device drivers, rather than
_against_ them. I think, in the long run this will save all of us lots
of grief.
Yes, we can have can_trust_media_change type functions, but it just
makes all drivers more complex, and eventually it will lead to a
situation where bad device drivers might unconditionnally return true
to it anyways, and we'd be back to square one.
Alain
Alexander Viro writes:
> BTW, what the hell is that?
> /*
> * hwgraph_bdevsw_get - returns the fops of the given devfs entry.
> */
> struct file_operations *
> hwgraph_bdevsw_get(devfs_handle_t de)
> {
> return(devfs_get_ops(de));
> }
>
> It's arch/ia64/sn/io/hcl.c. The funny thing being, the thing
> you will get from devfs_get_ops() will _not_ be struct
> file_operations *. And that's aside of the fact that any use of
> that function is very likely to be racy as hell. Sigh...
Sigh indeed. I didn't write that code. Looks like someone is (ab)using
the devfs API. Contact the maintainer of that code for insight.
Regards,
Richard....
Permanent: [email protected]
Current: [email protected]
Alexander Viro writes:
> It just gets better and better...
Sorry, Al. You're wrong on this one...
> in get_removable_partition()
> if (strcmp (de->name, "disc") == 0) return check_disc_changed (de);
>
> with
> if ( name && (namelen < 1) ) namelen = strlen (name);
> if ( ( new = kmalloc (sizeof *new + namelen, GFP_KERNEL) ) == NULL )
> return NULL;
> ...
> if (name) memcpy (new->name, name, namelen);
>
> in create_entry().
>
> IOW, ->name is not NUL-terminated.
In fact, it is. Take a look at the declaration of struct devfs_entry
and you'll see that there is one byte already taken up for the name
string. So adding the length of the string yields storage for len+1
bytes. IOW, space for the NULL-terminator. And right after the call to
kmalloc(), I call:
memset (new, 0, sizeof *new + namelen);
which zeros everything, including the terminator.
If I missed something as basic as NULL-termination, the thing probably
wouldn't even boot!
Regards,
Richard....
Permanent: [email protected]
Current: [email protected]
Alexander Viro writes:
> devfs_rmdir() checks that directory is empty. Then it calls
> devfsd_notify_one(), which can block. Then it marks the entry
> unregistered and reports success.
>
> Guess what will happen if devfs_register() will happen at that
> moment...
Yeah, I fixed that one a couple of months ago in the new tree. I now
unhook from the directory structure before sending the notification.
Regards,
Richard....
Permanent: [email protected]
Current: [email protected]
Alexander Viro writes:
> ... and one more - devfs_unregister() on a directory happening when
> mknod() in that directory sleeps in create_entry() (on kmalloc()).
>
> Do you ever read your own code? It's not like this stuff was hard
> to find - I'm just poking into random places and every damn one
> contains a hole. Sigh...
>
> Oh, BTW - here's another one: think what happens if tree-walking in
> unregister() steps on the entry we are currently removing in
> devfs_unlink().
Yep, as I've long ago admitted, there are races in the old devfs
code, which couldn't be fixed without proper locking. And that's why
I've been wanting to add said locking for ages, and have been
frustrated at interruptions which delayed that work. And I'm very
happy to get the first cut of the new code released.
That said, try to understand (before getting emotional and launching
off a tirade such as the one last week) that different people have
different priorities, and mine was to provide functionality first, and
worry about hostile attacks/exploits later. This is not unreasonable
if you consider that the initial target machines for devfs were:
- my personal boxes (which are not public machines)
- big-iron machines sitting behind a firewall
- small university group sitting behind a firewall (and I know where
all the users live:-)
I know your favourite horror scenario is the public machines available
to the undergrads, but not everyone works in such a hostile
environment.
Anyway, I hope Linus wasn't bored by all the messages you Cc:ed to
him for your^H^H^H^Hhis benefit :-O
Regards,
Richard....
Permanent: [email protected]
Current: [email protected]
Alexander Viro writes:
> BTW, here's one more devfs rmmod race: check_disk_changed() in
> fs/devfs/base.c. Calling ->check_media_change() with no protection
> whatsoever. If rmmod happens at that point...
How is this different from a call to fs/block_dev.c:check_disk_change()
which also has no protection?
Regards,
Richard....
Permanent: [email protected]
Current: [email protected]
On Tue, 6 Nov 2001, Richard Gooch wrote:
> Alexander Viro writes:
> > BTW, here's one more devfs rmmod race: check_disk_changed() in
> > fs/devfs/base.c. Calling ->check_media_change() with no protection
> > whatsoever. If rmmod happens at that point...
>
> How is this different from a call to fs/block_dev.c:check_disk_change()
> which also has no protection?
It's called either from driver itself (and then whatever protects caller
protects it as well) or after we'd pinned the thing down by blkdev_get().
On Mon, 5 Nov 2001, Richard Gooch wrote:
> Yep, as I've long ago admitted, there are races in the old devfs
> code, which couldn't be fixed without proper locking. And that's why
> I've been wanting to add said locking for ages, and have been
> frustrated at interruptions which delayed that work. And I'm very
> happy to get the first cut of the new code released.
BTW, new code still has both aforementioned races - detaching entries
from the tree doesn't help with that.
> That said, try to understand (before getting emotional and launching
> off a tirade such as the one last week) that different people have
> different priorities, and mine was to provide functionality first, and
> worry about hostile attacks/exploits later. This is not unreasonable
> if you consider that the initial target machines for devfs were:
> - my personal boxes (which are not public machines)
> - big-iron machines sitting behind a firewall
> - small university group sitting behind a firewall (and I know where
> all the users live:-)
That's nice, but that had stopped being the case as soon as you've proposed
devfs for inclusion into the tree...
Alexander Viro writes:
>
>
> On Mon, 5 Nov 2001, Richard Gooch wrote:
>
> > Yep, as I've long ago admitted, there are races in the old devfs
> > code, which couldn't be fixed without proper locking. And that's why
> > I've been wanting to add said locking for ages, and have been
> > frustrated at interruptions which delayed that work. And I'm very
> > happy to get the first cut of the new code released.
>
> BTW, new code still has both aforementioned races - detaching
> entries from the tree doesn't help with that.
Which "both"? You sent quite a few messages, listing more than two
races. I'm still wading through the list.
> > That said, try to understand (before getting emotional and launching
> > off a tirade such as the one last week) that different people have
> > different priorities, and mine was to provide functionality first, and
> > worry about hostile attacks/exploits later. This is not unreasonable
> > if you consider that the initial target machines for devfs were:
> > - my personal boxes (which are not public machines)
> > - big-iron machines sitting behind a firewall
> > - small university group sitting behind a firewall (and I know where
> > all the users live:-)
>
> That's nice, but that had stopped being the case as soon as you've
> proposed devfs for inclusion into the tree...
Marked CONFIG_EXPERIMENTAL...
Regards,
Richard....
Permanent: [email protected]
Current: [email protected]
On Tue, 6 Nov 2001, Richard Gooch wrote:
> > BTW, new code still has both aforementioned races - detaching
^^^^^^^^^^^^^^
> > entries from the tree doesn't help with that.
>
> Which "both"? You sent quite a few messages, listing more than two
> races. I'm still wading through the list.
Sorry - trimmed more than I should have. Ones mentioned in the message you
were replying to - unregister() on parent vs. mknod() or unlink().
Alexander Viro writes:
> On Sat, 27 Oct 2001, Alain Knaff wrote:
> > Good. But then, what's the point of devfs=only ? I assumed this was
>
> Ask Richard. Maybe you will be able to get a straight answer. I
> hadn't...
IIRC that I've told you this already. Here it is again: devfs=only
serves as a way of enforcing that the devfs entry->driver ops
connection is the only way of accessing a driver. It deliberately
breaks the fallback to major-table-lookup.
And it actually works now. It doesn't require massive 2.5 changes.
When I boot with devfs=only (which is always), my system still works.
Regards,
Richard....
Permanent: [email protected]
Current: [email protected]
On Tue, 6 Nov 2001, Richard Gooch wrote:
> IIRC that I've told you this already. Here it is again: devfs=only
> serves as a way of enforcing that the devfs entry->driver ops
> connection is the only way of accessing a driver. It deliberately
> breaks the fallback to major-table-lookup.
>
> And it actually works now. It doesn't require massive 2.5 changes.
> When I boot with devfs=only (which is always), my system still works.
Try rmmod while blkdev_get() (from mount(2) or swapon(2)) sleeps on ->bd_sem.
And think what happens with value of ->bd_op.
Alexander Viro writes:
>
>
> On Tue, 6 Nov 2001, Richard Gooch wrote:
>
> > > BTW, new code still has both aforementioned races - detaching
> ^^^^^^^^^^^^^^
> > > entries from the tree doesn't help with that.
> >
> > Which "both"? You sent quite a few messages, listing more than two
> > races. I'm still wading through the list.
>
> Sorry - trimmed more than I should have. Ones mentioned in the
> message you were replying to - unregister() on parent vs. mknod() or
> unlink().
Ah, those. Next patch (just released) cleans that up a bit more. This
stuff is still a work in progress.
Regards,
Richard....
Permanent: [email protected]
Current: [email protected]
[Finally, after other things have stabilised, I'm ready to tackle this
issue]
Alexander Viro writes:
> BTW, here's one more devfs rmmod race: check_disk_changed() in
> fs/devfs/base.c. Calling ->check_media_change() with no protection
> whatsoever. If rmmod happens at that point...
How about if I do this sequence:
lock_kernel();
devfs checks;
if (bd_op->owner)
__MOD_INC_USE_COUNT(bd_op->owner);
revalidate();
if (bd_op->owner)
__MOD_DEC_USE_COUNT(bd_op->owner);
unlock_kernel();
Is there any reason why that won't work?
Regards,
Richard....
Permanent: [email protected]
Current: [email protected]