2007-08-27 10:31:06

by Al Viro

[permalink] [raw]
Subject: [RFC] block_device_operations prototype changes

It's time to sanitize prototypes of bdev ->open(), ->release()
and ->ioctl(). This stuff had sat in "need to fix" for a long time
and there is a bunch of bugs hard to fix without dealing with it.

1) ->open() gets inode * and file *. Almost all instances use only
inode->i_bdev
file->f_mode
file->f_flags - O_ACCMODE|O_NDELAY|O_EXCL
and nothing else. Most actually use only inode->i_bdev->bd_disk and
those are easy; the trouble begins when we use f_mode and/or f_flags
in non-trivial ways. One good example is cdrom.c - its behaviour
depends on whether we have O_NDELAY (doesn't lock the door, among
other things) and on whether we are opening for write. Unfortunately,
it's not just something we care about at open() time - we need different
cleanups at close(), and that doesn't work at all.

At the very least, we need to know whether we want to unlock the door.
Suppose somebody opens it with O_NDELAY (ioctl-only) when it's already
opened for data (or mounted, for that matter). Now we close it; how
do we know if we need to unlock? Even if we did get struct file * (and
we do not, but that's a separate story), we could not rely on file->f_flags.
Why? Because O_NDELAY can be flipped at will by fcntl() after the file
got opened. IOW, we can't rely on it.

Most of that crap can be solved by saving the relevant information from
f_flags into f_mode (e.g. fs/block_dev.c::do_open()), where they'd remain
safe. However, that's not quite enough - struct file we pass to ->open()
is often a fake one and it certainly won't live until ->release().

One thing that _can_ be solved that way, though, and it alone makes the
scheme worthwhile: floppy.c abuses can be finally killed. Just saving
->f_flags & 2 in ->f_mode is enough to eliminate the need for floppy_open()
to play games with permission(9) and modify *file in any way. Anything
close to current fdutils either passes O_RDWR |... or 3 | ... to open()
when it wants to do priveleged ioctls and then we get open_namei check
that we can write to file. So we can check that bit saved in ->f_mode
for "can we do that kind of ioctl"; no need to mess with anything.

That's a *big* win, since this is eliminates the only place that uses
more than the fields mentioned above and the only place that tries to
modify struct file and check for modifications in later method calls.

So having done that, we get to the situation where ->open() only uses
inode->i_bdev and file->f_mode (and only reads these). I.e. we will be
able to go for much saner
int (*open)(struct block_device *bdev, mode_t mode)

We even can kill that fake struct file. At that point we have sanitized
->open(), no regressions and a chance to solve the problems with mode-dependent
cleanups on close(). However, to use that chance we'll need to do something
with ->release().

2) ->release() gets struct inode * and struct file * too. It only uses
inode->i_bdev->bd_disk and, sometimes, file->f_mode. Tries to use
file->f_mode, that is. Even all way back in 2.2 when it was ->release()
of file_operations, we used to get NULL on operations like umount, etc.,
so users of file->f_mode/file->f_flags had to check for file != NULL
first. And pray that all such users will want some reasonable default
behaviour.

For one thing, that's not true anymore. E.g. pktcdvd.c does blkdev_get()
with O_NDELAY and blkdev_put() as matching close. IOW, blkdev_put() can't
expect that matching open had no O_NDELAY.

For another, blkdev_put() is used by blkdev_close() since 2.3, so we _always_
get NULL now. IOW, the cleanup-related logics doesn't work properly, period.

The obvious first step is to switch to
int (*release)(struct gendisk *disk, mode_t mode)
(there are good reasons why open wants bdev and release does not). At least
that way we can start passing the right value without fake struct file, etc.

Now, the only thing we need is a way to pass that mode_t to blkdev_put().
IOW, blkdev_put() needs an extra argument. And that's where the things
get really interesting. blkdev_close() should just pass file->f_mode to
it; that's easy. However, we have more callers and need case-by-case
analysis for those.

We don't have too many of those callers, fortunately, and almost all of them
are easy - we know what had matching blkdev_open() got and that's it.

Exceptions are interesting.
* we have reiserfs journal that could've been opened r/o or r/w.
BFD - we can store the mode_t just fine along with the reference to bdev.
* a bug (AFAICT) in md.c - we open raid components r/w and if it
fails, it fails. Good for our purposes, but... how about raid0 on read-only
devices? In any case, we have a ready place to store mode_t, so it's not a
problem for getting the right value to blkdev_put(). FWIW, I think that
allowing fallback to r/o (and making the entire array r/o, of course) would
be a good thing to have. Any comments from drivers/md folks?
* open_bdev_excl()/close_bdev_excl(). Needs an extra argument for
the latter. Two interesting callers:
* kill_block_super() - need to store relevant mode_t in superblock,
along with reference to bdev. Note that just looking at sb->s_flags is *not*
enough - some filesystems go read-only on errors and that changes ->s_flags.
Another side of that is explicit remount from r/o to r/w and there we have
a bug - we do _not_ tell the driver that something had happened. Proper
fix is simple enough - bdget() + blkdev_get() for write + blkdev_put() with
old mode_t (i.e. FMODE_READ) once we are done remounting (and similar for
explicit remount to r/o).
* mtd2block may do unbalanced close_bdev_excl() and it passes bogus
arguments to open_bdev_excl() - NULL holder. Easy to fix.

So with that having dealt with we'll get reliable mode_t passed to ->release(),
matching those passed on ->open(). Victory.

3) ->ioctl(). What a mess... First of all, we have 3 methods with different
calling conventions:
->ioctl(inode, file, cmd, arg)
->unlocked_ioctl(inode, file, cmd, arg)
->compat_ioctl(file, cmd, arg)
The first one returns int; other two return long, bugger if I know why.
ioctl(3) returns int. So any value that doesn't fit into native int
will be truncated anyway. ->compat_ioctl() is even more ridiculous,
since it's called from 32bit task doing ioctl(2) and returning native
long is particulary dumb.

Moreover, we don't have enough instances of ->ioctl() to justify a separate
methods for locked and unlocked; taking BKL into the former and merging
them is easy enough.

Now, as with ->open(), we care only about inode->i_bdev and file->f_mode
(assuming we'd dealt with ->f_flags -> ->f_mode transition above). Which
is what these suckers need to get. IOW, we get to
int (*ioctl)(bdev, mode, cmd, arg)
int (*compat_ioctl)(bdev, mode, cmd, arg)
both assuming that BKL had not been taken by caller.

That will take a series of preparatory patches changing the prototypes of
common helpers (scsi_cmd_ioctl() and its ilk). BTW, what the hell is
block/bsg.c doing with passing NULL bd_disk to scsi_cmd_ioctl()? Not a
problem for this patch series, but it looks fishy...

That's more or less it; resulting APIs will be a lot saner and entire thing
is reasonably easy to split into bisect-friendly chunks. It is an API breaker,
of course, but we don't have a lot of affected modules and anything not
converted will be (a) immediately caught by gcc and (b) trivial to convert.

I have the beginning of that series done and the rest mapped out in enough
details to implement it over this week. If somebody has objections,
questions or comments - yell.


2007-08-27 16:36:25

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC] block_device_operations prototype changes



On Mon, 27 Aug 2007, Al Viro wrote:
>
> I have the beginning of that series done and the rest mapped out in enough
> details to implement it over this week. If somebody has objections,
> questions or comments - yell.

>From your description, I have no objections - everything sounds good. My
only concern is how painful the patch ends up being (and a worry about
whether this will affect a metric truck-load of external modules? That
said, I can't really see us worrying about those)

Linus

2007-08-27 22:59:30

by Alasdair G Kergon

[permalink] [raw]
Subject: Re: [RFC] block_device_operations prototype changes

On Mon, Aug 27, 2007 at 11:30:53AM +0100, Al Viro wrote:
> 3) ->ioctl(). What a mess...

Yup.

See also:
Subject: [PATCH] dm: support ioctls on mapped devices: fix with fake file
http://uwsg.indiana.edu/hypermail/linux/kernel/0606.2/2979.html

and related threads.

> First of all, we have 3 methods with different
> calling conventions:
> ->ioctl(inode, file, cmd, arg)
> ->unlocked_ioctl(inode, file, cmd, arg)

When I last looked it was:
long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
with the lack of inode forcing dm to use ->ioctl (because file can be NULL when
only the block device is known) and immediately drop the pointless-for-us
lock!

Alasdair
--
[email protected]

2007-08-27 23:34:04

by NeilBrown

[permalink] [raw]
Subject: Re: [RFC] block_device_operations prototype changes

On Monday August 27, [email protected] wrote:
> * a bug (AFAICT) in md.c - we open raid components r/w and if it
> fails, it fails. Good for our purposes, but... how about raid0 on read-only
> devices? In any case, we have a ready place to store mode_t, so it's not a
> problem for getting the right value to blkdev_put(). FWIW, I think that
> allowing fallback to r/o (and making the entire array r/o, of course) would
> be a good thing to have. Any comments from drivers/md folks?

I've never heard any suggestion of anybody wanting to include a
readonly device in an md array - I guess one could imagine an array
of CDROMs, but I doubt anyone would actually create one.
However I agree that falling back to read-only (particular if the
error value indicated that this was the only problem) would be a
sensible thing to do in some circumstances.

More interestingly: one could argue that when the md array is
switched to read-only, each component device should be reopened as
read-only. So I would really like the interface to have a way to
switch a device between read-only and read-write without closing and
re-opening.

Though I guess opening and then closing as you suggest below would be OK.

> * open_bdev_excl()/close_bdev_excl(). Needs an extra argument for
> the latter. Two interesting callers:
> * kill_block_super() - need to store relevant mode_t in superblock,
> along with reference to bdev. Note that just looking at sb->s_flags is *not*
> enough - some filesystems go read-only on errors and that changes ->s_flags.
> Another side of that is explicit remount from r/o to r/w and there we have
> a bug - we do _not_ tell the driver that something had happened. Proper
> fix is simple enough - bdget() + blkdev_get() for write + blkdev_put() with
> old mode_t (i.e. FMODE_READ) once we are done remounting (and similar for
> explicit remount to r/o).


I would *really* like to see this. When the root filesystem gets
mounted read-only, I want md to know so that it can mark the array as
'clean' straight away. I really don't like having a reboot-notifier
to mark all arrays as 'clean' on shutdown.

So each device would be responsible for keeping a count of 'readonly'
and 'read-write' accesses? Or would that be in common code?

I look forward to this all getting cleaned up!

NeilBrown