LinuxLists.cc - RFC: Ioctl v2

2022-05-23 08:06:52

Subject: RFC: Ioctl v2

At LSF we had our annual talk about all the ways ioctls suck. It got me thinking
about a proposal for a simple lightweight replacement.

Problems with ioctls:

* There's no namespacing, and ioctl numbers clash

Ioctl numbers clashing isn't a _huge_ issue in practice because you'll only have
so many chunks of code handling ioctls for the same FD (VFS, filesystem or
driver) and because ioctl struct size is also dispatched on, but it is pretty
gross - there's nothing preventing different drivers from picking the same ioctl
numbers. And since we've got one byte for the "namespace" and another byte for
the ioctl number, and according to my grep 7k different ioctls, betcha this
happens somewhere - I think Luis had an example at LSF.

Where the lack of real namespacing bites us more is when ioctls get promoted
from filesystem or driver on up. Ted had a good example of an ext2 ioctl getting
promoted to the VFS when it really shouldn't have, because it was exposing ext2
specific data structures.

But because this is as simple as changing a #define EXT2_IOC to #define FS_IOC,
it's really easy to do without adequate review - you don't have to change the
ioctl number and break userspace, so why would you?

Introducing real namespacing would mean that promoting an ioctl to the VFS level
would really have to be a new ioctl, and it'll get people to think more about
what the new ioctl would be.

* The calling convention sucks

With ioctls, you have to define a struct for your parameters, and struct members
might be used for inputs, or outputs, or both.

The problem is, these structs really need to be fully portable the same way
structs defined for on disk formats have to be, and we've got no way of checking
for this. This is a real minefield: if you need to pass a pointer type, you
can't pass a pointer because sizeof(void *) is different (and kernel space might
be 64 bit, with userspace 32 bit or 64 bit) - and you can't pass a ulong either,
it has to be a u64.

The whole "define a struct for your parameters" was a hack and a bad idea.
Ioctls are just function calls - they're driver-private syscalls - and they
should work like function calls.

IOCTL v2 proposal:

* Namespacing

To solve the namespacing issue, I want to steal an approach I've seen from
OpenGL, where extensions are namespaced: an extension will be referenced by name
where the name is vendor.foo, and when an extension becomes standard it gets a
new name (arb.foo instead of nvidia.foo, I think? it's been awhile).

To do this we'll need to define ioctls by name, not by hardcoded number, and
likewise userspace will have to call ioctls by name, not by number. To avoid a
string lookup on every ioctl call, I propose a new syscall

int sys_get_ioctl_nr(char *name)

And then userspace will just call this once for every ioctl it uses, either at
program startup or lazily when an ioctl is first called. This can all be nicely
hidden in a little wrapper library.

We'll want to randomize ioctl numbers in kernel space, to ensure userspace
_can't_ hard code them.

Also, another thing that came up at LSF was introspection, it's hard for
strace() et al to handle ioctls. Implementing this name -> nr mapping will give
us a registry of ioctls supported on a given kernel which we can make available
in /proc; and while we're at it, why not include the prototype too?

* Better calling convention

ioctls are just private syscalls. Syscalls look like normal function calls, why
can't ioctls? Some ioctls do complicated things that require defining structs
with all the tricky layout rules that we kernel devs have all had beaten into
our brains - but most probably would not, if we could do normal-looking function
calls.

Well, syscalls do require arch specific code to handle calling conventions, and
we don't want that. What I propose doing is having the underlying syscall be

#define IOCTL_MAXARGS 8

struct ioctl_args {
__u64 args[IOCTL_MAXARGS];
};

int sys_ioctl_v2(int fd, int ioctl_nr, struct ioctl_args __user *args)

Userspace won't call this directly. Userspace will call normal looking
functions, like:

int bcachefs_ioctl_disk_add(int fd, unsigned flags, char __user *disk_path);

Which will be a wrapper that casts the function arguments to u64s (or s64s for
signed integers, so that we don't have surprises when kernel space and user
space disagree about sizeof(long)) and then does the actual syscall.

------------------

I want to circulate this and get some comments and feedback, and if no one
raises any serious objections - I'd love to get collaborators to work on this
with me. Flame away!

2022-05-26 03:04:41

by Andrew Lunn

[permalink] [raw]

Subject: Re: RFC: Ioctl v2

> Actually, I have one in bcachefs that might fit better into the netlink bucket -
> maybe while I've got your attention you could tell me what this is like in
> netlink land.
>
> In bcachefs, we have "data jobs", where userspace asks us to do something that
> requires walking data and performing some operation on them - this is used for
> manual rebalance, evacuating data off a device, scrub (when that gets
> implemented), etc.
>
> The way I did this was with an ioctl that takes as a parameter the job to
> perform, then it kicks off a kernel thread to do the work and returns a file
> descriptor, which userspace reads from to find out the current status of the job
> (which it uses to implement a progress indicator). We kill off the kthread if
> the file descriptor is closed, meaning ctrl-c works as expected.
>
> I really like how this turned out, it's not much code and super slick - I was
> considering abstracting it out as generic functionality. But this definitely
> sounds like what netlink is targeted at - thoughts?

What is tricky with networking, is that it has a Big Lock, the
RTNL. All ioctl and netlink operations are performed while holding
this lock. So you cannot do an operation which takes a while.

But i implemented something similar to what you want a couple of years
ago. Ethernet cable testing. It is split into a couple of netlink
messages. There is one to initiate the cable test, and you can pass
some parameters. If the Ethernet PHY supports it, you get back an
immediate ACK, or an error messages, with probably -EOPNOTSUP, or
-EINVAL. This is all done with the RTNL held, the lock being released
after the reply.

The PHY then actually starts doing the cable test. I can take from a
couple of seconds, to 10-20 seconds, depending on exactly how it is
implemented, how fast the PHY is etc.

Once the PHY has finished, it broadcasts a report of the cable test to
userspace. Any process can receive this. So the invoking ethtool
--cable-test eth42 process waits around for it and dumps the test
results.

broadcasting messages is a big part of netlink. 'ip monitor' will
receive all these broadcasts and decode them. So you get to see routes
added/remove, ARP resolutions, interfaces going up and down. etc.

Your ctrl-c handling does not exist, as far as i know. With cable
testing, it runs to completion and makes the report. It could be there
is nobody listening. At least for some PHYs you cannot abort a cable
test once started, so ctrl-c might not even make sense.

There is a video of my LPC talk online somewhere. But 1/2 of it is
physics, how cable testing actually works. There is some details of
the netlink part and how the PHY state machine works.

Andrew

2022-05-26 18:41:36

by Kent Overstreet

[permalink] [raw]

Subject: Re: RFC: Ioctl v2

On Fri, May 20, 2022 at 07:45:39PM -0400, Theodore Ts'o wrote:
> On Fri, May 20, 2022 at 12:16:52PM -0400, Kent Overstreet wrote:
> >
> > Where the lack of real namespacing bites us more is when ioctls get promoted
> > from filesystem or driver on up. Ted had a good example of an ext2 ioctl getting
> > promoted to the VFS when it really shouldn't have, because it was exposing ext2
> > specific data structures.
> >
> > But because this is as simple as changing a #define EXT2_IOC to #define FS_IOC,
> > it's really easy to do without adequate review - you don't have to change the
> > ioctl number and break userspace, so why would you?
> >
> > Introducing real namespacing would mean that promoting an ioctl to the VFS level
> > would really have to be a new ioctl, and it'll get people to think more about
> > what the new ioctl would be.
>
> It's not clear that making harder not to break userspace is a
> *feature*. If existing programs are using a particular ioctl
> namespace, being able to have other file systems adopt it has
> historically been considered a *feature* not a *bug*.
>
> At the time, we had working utilities, chattr and lsattr, which were
> deployed on all Linux distributions, and newer file systems, such as
> xfs, reiserfs, btrfs, etc., decided they wanted to piggy-back on those
> existing utilities. Forcing folks to deploy new utilities just
> because it's the best way to force "adequate review" might be swinging
> the pendulum too far in the straight-jacket direction.

But back in those days, users updating those core utilities was a much bigger
hassle. That's changing, we don't really have users compiling from source
anymore, and we're slowly but steadily moving towards the rolling releases
world: slackware -> debian -> nixos.

And on the developer side of things, historically developers have worked on a
few packages where they were comfortable with the process, and packages tended
more towards giant monorepos, but the younger generation is more used to working
across multiple packages/repositories as necessary (this is where github has
been emphatically a good thing, despite being proprietary; it's standardized a
lot of the friction-y "how do I submit to this repo" stuff).

So I understand where you're coming from but I think this is worth rethinking.

Additionally, bikeshedding gets really painful when people are trying to plan
for all eventualities and think too far ahead. Having multiple stages of review
is _helpful_ with this. If an ioctl is filesystem-private, it's perfectly fine
for it embed filesystem specific data structures - if we can ensure that that
won't get lifted to the VFS layer without anyone noticing!

> ...
>
> In the case of extended attributes, we had perfectly working userspace
> tools that would have ***broken*** if we adhered to a doctrinaire,
> when you promote an interface, we break the userspace interface Just
> Because it's the Good Computer Science Thing To Do.

Not broken, though - it just would've needed updating to support additional
filesystems, and when the ioctls don't need changing the patches would be
trivial:

ret = ioctl_xfs_goingdown(..);
becomes
ret = ioctl_fs_goingdown(...) ?:
ioctl_xfs_goingdown(...);

(I'm the only one I know who does that chaining syntax in C, but I like it :)

> So this approach requires that someone has to actually implement the
> wrapper library. Who will that be? The answer could be, "let libc do
> it", but then we need to worry about all the C libraries out there
> actually adopting the new ioctl, which takes a long time, and
> historically, some C library maintainers have had.... opinionated
> points of view about the sort of "value add that should be done at the
> C Library level".

Not libc, and we definitely don't want to have to update that library for every
new ioctl - I'm imagining that library just being responsible for the "query
kernel for ioctl numbers" part, the ioctl definitions themselves will still come
from kernel headers.

> For example, I have an ext2fs library function
> ext2fs_get_device_size2(), which wraps not only the BLKGETSIZE and
> BLKGETSIZE64 ioctls, but also the equivalents for Apple Darwin
> (DKIOCGETBLOCKCOUNT), FreeBSD/NetBSD (DIOCGDINFO and later
> DIOCGMEDIASIZE), and the Window's DeviceIoControl()'s
> IOCTL_DISK_GET_DRIVE_GEOMETRY call. The point is that wrapper
> functions are very much orthogonal to the ioctl interface; we're all
> going to have wrapper functions, and we'll create them where they are
> needed.

This seems unrelated to the ioctl v2 discussion, but - it would be _great_ if we
could get that in a separate repository where others could make use of it :)

2022-06-09 22:30:27

by Jan Engelhardt

[permalink] [raw]

Subject: Re: RFC: Ioctl v2

On Friday 2022-05-20 18:16, Kent Overstreet wrote:
>Problems with ioctls:
>
>* There's no namespacing, and ioctl numbers clash
>
>IOCTL v2 proposal:
>
>* Namespacing
>
>To solve the namespacing issue, I want to steal an approach I've seen from
>OpenGL, where extensions are namespaced: an extension will be referenced by name
>where the name is vendor.foo, and when an extension becomes standard it gets a
>new name (arb.foo instead of nvidia.foo, I think? it's been awhile).
>To do this we'll need to define ioctls by name, not by hardcoded number, [...]

https://www.khronos.org/registry/vulkan/specs/1.3-extensions/man/html/VK_NV_glsl_shader.html
Right there on the front matter: "Registered number #13".

https://www.khronos.org/registry/vulkan/specs/1.3/styleguide.html
"All extensions must be registered with Khronos."
"Each extension is assigned a range of values that can be used to create globally-unique enum
values"