2023-08-11 11:50:47

by Jan Kara

[permalink] [raw]
Subject: [PATCH v2 0/29] block: Make blkdev_get_by_*() return handle

Hello,

this is a v2 of the patch series which implements the idea of blkdev_get_by_*()
calls returning bdev_handle which is then passed to blkdev_put() [1]. This
makes the get and put calls for bdevs more obviously matching and allows us to
propagate context from get to put without having to modify all the users
(again!). In particular I need to propagate used open flags to blkdev_put() to
be able count writeable opens and add support for blocking writes to mounted
block devices. I'll send that series separately.

The series is based on Christian's vfs tree as of yesterday as there is quite
some overlap. Patches have passed some reasonable testing - I've tested block
changes, md, dm, bcache, xfs, btrfs, ext4, swap. This obviously doesn't cover
everything so I'd like to ask respective maintainers to review / test their
changes. Thanks! I've pushed out the full branch to:

git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs.git bdev_handle

to ease review / testing.

Changes since v1:
* Rebased on top of current vfs tree
* Renamed final functions to bdev_open_by_*() and bdev_release()
* Fixed detection of exclusive open in blkdev_ioctl() and blkdev_fallocate()
* Fixed swap conversion to properly reinitialize swap_info->bdev_handle
* Fixed xfs conversion to not oops with rtdev without logdev
* Couple other minor fixups

Honza

[1] https://lore.kernel.org/all/[email protected]

CC: Alasdair Kergon <[email protected]>
CC: Andrew Morton <[email protected]>
CC: Anna Schumaker <[email protected]>
CC: Chao Yu <[email protected]>
CC: Christian Borntraeger <[email protected]>
CC: Coly Li <[email protected]
CC: "Darrick J. Wong" <[email protected]>
CC: Dave Kleikamp <[email protected]>
CC: David Sterba <[email protected]>
CC: [email protected]
CC: [email protected]
CC: Gao Xiang <[email protected]>
CC: Jack Wang <[email protected]>
CC: Jaegeuk Kim <[email protected]>
CC: [email protected]
CC: Joern Engel <[email protected]>
CC: Joseph Qi <[email protected]>
CC: Kent Overstreet <[email protected]>
CC: [email protected]
CC: [email protected]
CC: [email protected]
CC: <[email protected]>
CC: [email protected]
CC: [email protected]
CC: [email protected]
CC: [email protected]
CC: [email protected]
CC: [email protected]
CC: [email protected]
CC: [email protected]
CC: [email protected]
CC: [email protected]
CC: [email protected]
CC: "Md. Haris Iqbal" <[email protected]>
CC: Mike Snitzer <[email protected]>
CC: Minchan Kim <[email protected]>
CC: [email protected]
CC: [email protected]
CC: Sergey Senozhatsky <[email protected]>
CC: Song Liu <[email protected]>
CC: Sven Schnelle <[email protected]>
CC: [email protected]
CC: Ted Tso <[email protected]>
CC: Trond Myklebust <[email protected]>
CC: [email protected]

Previous versions:
Link: http://lore.kernel.org/r/[email protected] # v1


2023-08-11 12:05:49

by Jan Kara

[permalink] [raw]
Subject: [PATCH 25/29] nfs/blocklayout: Convert to use bdev_open_by_dev/path()

Convert block device handling to use bdev_open_by_dev/path() and pass
the handle around.

CC: [email protected]
CC: Trond Myklebust <[email protected]>
CC: Anna Schumaker <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
---
fs/nfs/blocklayout/blocklayout.h | 2 +-
fs/nfs/blocklayout/dev.c | 76 ++++++++++++++++----------------
2 files changed, 38 insertions(+), 40 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.h b/fs/nfs/blocklayout/blocklayout.h
index 716bc75e9ed2..b4294a8aa2d4 100644
--- a/fs/nfs/blocklayout/blocklayout.h
+++ b/fs/nfs/blocklayout/blocklayout.h
@@ -108,7 +108,7 @@ struct pnfs_block_dev {
struct pnfs_block_dev *children;
u64 chunk_size;

- struct block_device *bdev;
+ struct bdev_handle *bdev_handle;
u64 disk_offset;

u64 pr_key;
diff --git a/fs/nfs/blocklayout/dev.c b/fs/nfs/blocklayout/dev.c
index 70f5563a8e81..0b57b8a4b7e2 100644
--- a/fs/nfs/blocklayout/dev.c
+++ b/fs/nfs/blocklayout/dev.c
@@ -25,17 +25,17 @@ bl_free_device(struct pnfs_block_dev *dev)
} else {
if (dev->pr_registered) {
const struct pr_ops *ops =
- dev->bdev->bd_disk->fops->pr_ops;
+ dev->bdev_handle->bdev->bd_disk->fops->pr_ops;
int error;

- error = ops->pr_register(dev->bdev, dev->pr_key, 0,
- false);
+ error = ops->pr_register(dev->bdev_handle->bdev,
+ dev->pr_key, 0, false);
if (error)
pr_err("failed to unregister PR key.\n");
}

- if (dev->bdev)
- blkdev_put(dev->bdev, NULL);
+ if (dev->bdev_handle)
+ bdev_release(dev->bdev_handle);
}
}

@@ -169,7 +169,7 @@ static bool bl_map_simple(struct pnfs_block_dev *dev, u64 offset,
map->start = dev->start;
map->len = dev->len;
map->disk_offset = dev->disk_offset;
- map->bdev = dev->bdev;
+ map->bdev = dev->bdev_handle->bdev;
return true;
}

@@ -236,28 +236,26 @@ bl_parse_simple(struct nfs_server *server, struct pnfs_block_dev *d,
struct pnfs_block_volume *volumes, int idx, gfp_t gfp_mask)
{
struct pnfs_block_volume *v = &volumes[idx];
- struct block_device *bdev;
+ struct bdev_handle *bdev_handle;
dev_t dev;

dev = bl_resolve_deviceid(server, v, gfp_mask);
if (!dev)
return -EIO;

- bdev = blkdev_get_by_dev(dev, BLK_OPEN_READ | BLK_OPEN_WRITE, NULL,
- NULL);
- if (IS_ERR(bdev)) {
+ bdev_handle = bdev_open_by_dev(dev, BLK_OPEN_READ | BLK_OPEN_WRITE,
+ NULL, NULL);
+ if (IS_ERR(bdev_handle)) {
printk(KERN_WARNING "pNFS: failed to open device %d:%d (%ld)\n",
- MAJOR(dev), MINOR(dev), PTR_ERR(bdev));
- return PTR_ERR(bdev);
+ MAJOR(dev), MINOR(dev), PTR_ERR(bdev_handle));
+ return PTR_ERR(bdev_handle);
}
- d->bdev = bdev;
-
-
- d->len = bdev_nr_bytes(d->bdev);
+ d->bdev_handle = bdev_handle;
+ d->len = bdev_nr_bytes(bdev_handle->bdev);
d->map = bl_map_simple;

printk(KERN_INFO "pNFS: using block device %s\n",
- d->bdev->bd_disk->disk_name);
+ bdev_handle->bdev->bd_disk->disk_name);
return 0;
}

@@ -302,10 +300,10 @@ bl_validate_designator(struct pnfs_block_volume *v)
}
}

-static struct block_device *
+static struct bdev_handle *
bl_open_path(struct pnfs_block_volume *v, const char *prefix)
{
- struct block_device *bdev;
+ struct bdev_handle *bdev_handle;
const char *devname;

devname = kasprintf(GFP_KERNEL, "/dev/disk/by-id/%s%*phN",
@@ -313,15 +311,15 @@ bl_open_path(struct pnfs_block_volume *v, const char *prefix)
if (!devname)
return ERR_PTR(-ENOMEM);

- bdev = blkdev_get_by_path(devname, BLK_OPEN_READ | BLK_OPEN_WRITE, NULL,
- NULL);
- if (IS_ERR(bdev)) {
+ bdev_handle = bdev_open_by_path(devname, BLK_OPEN_READ | BLK_OPEN_WRITE,
+ NULL, NULL);
+ if (IS_ERR(bdev_handle)) {
pr_warn("pNFS: failed to open device %s (%ld)\n",
- devname, PTR_ERR(bdev));
+ devname, PTR_ERR(bdev_handle));
}

kfree(devname);
- return bdev;
+ return bdev_handle;
}

static int
@@ -329,7 +327,7 @@ bl_parse_scsi(struct nfs_server *server, struct pnfs_block_dev *d,
struct pnfs_block_volume *volumes, int idx, gfp_t gfp_mask)
{
struct pnfs_block_volume *v = &volumes[idx];
- struct block_device *bdev;
+ struct bdev_handle *bdev_handle;
const struct pr_ops *ops;
int error;

@@ -342,32 +340,32 @@ bl_parse_scsi(struct nfs_server *server, struct pnfs_block_dev *d,
* On other distributions like Debian, the default SCSI by-id path will
* point to the dm-multipath device if one exists.
*/
- bdev = bl_open_path(v, "dm-uuid-mpath-0x");
- if (IS_ERR(bdev))
- bdev = bl_open_path(v, "wwn-0x");
- if (IS_ERR(bdev))
- return PTR_ERR(bdev);
- d->bdev = bdev;
-
- d->len = bdev_nr_bytes(d->bdev);
+ bdev_handle = bl_open_path(v, "dm-uuid-mpath-0x");
+ if (IS_ERR(bdev_handle))
+ bdev_handle = bl_open_path(v, "wwn-0x");
+ if (IS_ERR(bdev_handle))
+ return PTR_ERR(bdev_handle);
+ d->bdev_handle = bdev_handle;
+
+ d->len = bdev_nr_bytes(d->bdev_handle->bdev);
d->map = bl_map_simple;
d->pr_key = v->scsi.pr_key;

pr_info("pNFS: using block device %s (reservation key 0x%llx)\n",
- d->bdev->bd_disk->disk_name, d->pr_key);
+ d->bdev_handle->bdev->bd_disk->disk_name, d->pr_key);

- ops = d->bdev->bd_disk->fops->pr_ops;
+ ops = d->bdev_handle->bdev->bd_disk->fops->pr_ops;
if (!ops) {
pr_err("pNFS: block device %s does not support reservations.",
- d->bdev->bd_disk->disk_name);
+ d->bdev_handle->bdev->bd_disk->disk_name);
error = -EINVAL;
goto out_blkdev_put;
}

- error = ops->pr_register(d->bdev, 0, d->pr_key, true);
+ error = ops->pr_register(d->bdev_handle->bdev, 0, d->pr_key, true);
if (error) {
pr_err("pNFS: failed to register key for block device %s.",
- d->bdev->bd_disk->disk_name);
+ d->bdev_handle->bdev->bd_disk->disk_name);
goto out_blkdev_put;
}

@@ -375,7 +373,7 @@ bl_parse_scsi(struct nfs_server *server, struct pnfs_block_dev *d,
return 0;

out_blkdev_put:
- blkdev_put(d->bdev, NULL);
+ bdev_release(d->bdev_handle);
return error;
}

--
2.35.3


2023-08-11 13:05:29

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v2 0/29] block: Make blkdev_get_by_*() return handle

Except for a mostly cosmetic nitpick this looks good to me:

Acked-by: Christoph Hellwig <[email protected]>

That's not eactly the deep review I'd like to do, but as I'm about to
head out for vacation that's probably as good as it gets.

2023-08-26 07:41:32

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v2 0/29] block: Make blkdev_get_by_*() return handle

On Fri, Aug 25, 2023 at 03:47:56PM +0200, Jan Kara wrote:

> I can see the appeal of not having to introduce the new bdev_handle type
> and just using struct file which unifies in-kernel and userspace block
> device opens. But I can see downsides too - the last fput() happening from
> task work makes me a bit nervous whether it will not break something
> somewhere with exclusive bdev opens. Getting from struct file to bdev is
> somewhat harder but I guess a helper like F_BDEV() would solve that just
> fine.
>
> So besides my last fput() worry about I think this could work and would be
> probably a bit nicer than what I have. But before going and redoing the whole
> series let me gather some more feedback so that we don't go back and forth.
> Christoph, Christian, Jens, any opinion?

Redoing is not an issue - it can be done on top of your series just
as well. Async behaviour of fput() might be, but... need to look
through the actual users; for a lot of them it's perfectly fine.

FWIW, from a cursory look there appears to be a missing primitive: take
an opened bdev (or bdev_handle, with your variant, or opened file if we
go that way eventually) and claim it.

I mean, look at claim_swapfile() for example:
p->bdev = blkdev_get_by_dev(inode->i_rdev,
FMODE_READ | FMODE_WRITE | FMODE_EXCL, p);
if (IS_ERR(p->bdev)) {
error = PTR_ERR(p->bdev);
p->bdev = NULL;
return error;
}
p->old_block_size = block_size(p->bdev);
error = set_blocksize(p->bdev, PAGE_SIZE);
if (error < 0)
return error;
we already have the file opened, and we keep it opened all the way until
the swapoff(2); here we have noticed that it's a block device and we
* open the fucker again (by device number), this time claiming
it with our swap_info_struct as holder, to be closed at swapoff(2) time
(just before we close the file)
* flip the block size to PAGE_SIZE, to be reverted at swapoff(2)
time That really looks like it ought to be
* take the opened file, see that it's a block device
* try to claim it with that holder
* on success, flip the block size
with close_filp() in the swapoff(2) (or failure exit path in swapon(2))
doing what it would've done for an O_EXCL opened block device.
The only difference from O_EXCL userland open is that here we would
end up with holder pointing not to struct file in question, but to our
swap_info_struct. It will do the right thing.

This extra open is entirely due to "well, we need to claim it and the
primitive that does that happens to be tied to opening"; feels rather
counter-intuitive.

For that matter, we could add an explicit "unclaim" primitive - might
be easier to follow. That would add another example where that could
be used - in blkdev_bszset() we have an opened block device (it's an
ioctl, after all), we want to change block size and we *really* don't
want to have that happen under a mounted filesystem. So if it's not
opened exclusive, we do a temporary exclusive open of own and act on
that instead. Might as well go for a temporary claim...

BTW, what happens if two threads call ioctl(fd, BLKBSZSET, &n)
for the same descriptor that happens to have been opened O_EXCL?
Without O_EXCL they would've been unable to claim the sucker at the same
time - the holder we are using is the address of a function argument,
i.e. something that points to kernel stack of the caller. Those would
conflict and we either get set_blocksize() calls fully serialized, or
one of the callers would eat -EBUSY. Not so in "opened with O_EXCL"
case - they can very well overlap and IIRC set_blocksize() does *not*
expect that kind of crap... It's all under CAP_SYS_ADMIN, so it's not
as if it was a meaningful security hole anyway, but it does look fishy.