When I posted the last round of ipath driver code for review, people
objected to the number of ioctls we had. I'd like to get feedback on
what would be acceptable replacements.
We have four kinds of ioctl right now:
* Interfacing with userspace
* Infiniband subnet management
* Flash/EEPROM management
* Diagnostics
There are currently 36 ioctls in total. I think that I can reduce this
number dramatically, but we're having some contentious internal debate
about whether and how some of the ioctls should be replaced. I'd like
to see what's most likely to get accepted. Obviously, we'd prefer the
number to be zero, but I don't think we can do that without submitting a
driver that isn't very useful.
Unless I indicate otherwise, I cannot think of clean replacements for
the ioctls listed below, and would appreciate suggestions.
For user access:
Opening the /dev/ipath special file assigns an appropriate free
unit (chip) and port (context on a chip) to a user process.
Think of it as similar to /dev/ptmx for ttys, except there isn't
a devpts-like filesystem behind it. Once a process has
opened /dev/ipath, it needs to find out which unit and port it
has opened, so that it can access other attributes in /sys. To
do this, we provide a GETPORT ioctl.
USERINIT and BASEINFO work with mmap to set up direct access to
the hardware for user processes. We intend to turn these into a
single ioctl, USERINIT. This copies a substantial amount of
information to and from userspace.
RCVCTRL enables/disables receipt of packets.
SET_PKEY sets a partition key, essentially telling hardware
which packets are interesting to userspace.
UPDM_TID and FREE_TID are used for RDMA context management.
WAIT waits for incoming packets, and can clearly be replaced by
file_ops->poll.
GETCOUNTERS, GETUNITCOUNTERS and GETSTATS can all be replaced by
files in sysfs.
For subnet management:
GETLID, SET_LID, SET_MTU, SET_GUID, SET_MLID, GET_MLID,
GET_DEVSTATUS, GET_PORTINFO and GET_NODEINFO can all be replaced
by files in sysfs.
SET_LINKSTATE changes the link state.
SEND_SMA_PKT and RCV_SMA_PKT send and receive subnet management
packets. I *think* they could be replaced by read and write
methods on a new special file, although the semantics aren't a
super-clean match.
For EEPROM/flash management:
READ_EEPROM reads the flash. WRITE_EEPROM writes it. I don't
see a standard way of doing this in the kernel; many drivers
provide their own private ioctls, some on dedicated special
files. I think that using read and write instead would be okay
(with a small qualm about semantics), but this idea makes an
influential coworker barf violently. I can't see how we could
use the ethtool flash interface: the low-level driver doesn't
look like a regular net device, and we support partial updates
of the flash.
For diagnostics:
DIAGENTER and DIAGLEAVE put the driver into and out of diag
mode. These could be replaced by open/close of a special file.
DIAGREAD and DIAGWRITE perform direct accesses to the device's
PCI memory space. I think these could be replaced by read and
write, but they are again subject to the make-coworker-barf
problem.
HTREAD and HTREAD perform direct accesses to the device's PCI
config space. Same disagreement problem as DIAGREAD and
DIAGWRITE.
SEND_DIAG_PKT can be replaced with whatever sends and receives
subnet management packets, as above.
DIAG_RD_I2C is synonymous with READ_EEPROM, and will go away.
Depending on how you look at it, we can slim our list of ioctls down to
somewhere between 6 and 10. This isn't zero, but it's not 36, either.
What do people think?
<b
From: Bryan O'Sullivan <[email protected]>
Date: Wed, 18 Jan 2006 16:43:31 -0800
> Obviously, we'd prefer the number to be zero, but I don't think we
> can do that without submitting a driver that isn't very useful.
You can use an interface such a netlink for device configuration.
It can do better type checking, can be used by generic tools, and
some day soon will be transferable over the wire so that one can
perform remote configuration changes.
Let's let ioctl()'s go the way of the cave man. It's one of the
worst designed interfaces undef UNIX :)
On Wed, Jan 18, 2006 at 04:43:31PM -0800, Bryan O'Sullivan wrote:
> For EEPROM/flash management:
>
> READ_EEPROM reads the flash. WRITE_EEPROM writes it. I don't
> see a standard way of doing this in the kernel; many drivers
> provide their own private ioctls, some on dedicated special
> files. I think that using read and write instead would be okay
> (with a small qualm about semantics), but this idea makes an
> influential coworker barf violently. I can't see how we could
> use the ethtool flash interface: the low-level driver doesn't
> look like a regular net device, and we support partial updates
> of the flash.
Use the firmware subsystem for this. It uses sysfs so ioctl needed at
all.
thanks,
greg k-h
On Wed, 2006-01-18 at 16:48 -0800, David S. Miller wrote:
> You can use an interface such a netlink for device configuration.
> It can do better type checking, can be used by generic tools, and
> some day soon will be transferable over the wire so that one can
> perform remote configuration changes.
That looks doable, but to my eyes, the netlink interface looks both more
cumbersome and less reliable than ioctl. At least it apparently lets us
do arbitrarily peculiar things :-)
<b
On Wed, 2006-01-18 at 16:53 -0800, Greg KH wrote:
> Use the firmware subsystem for this. It uses sysfs so ioctl needed at
> all.
OK. Would I be correct in thinking that drivers/firmware/dcdbas.c is a
reasonable model implementation to follow?
<b
From: Bryan O'Sullivan <[email protected]>
Date: Wed, 18 Jan 2006 17:14:16 -0800
> That looks doable, but to my eyes, the netlink interface looks both
> more cumbersome and less reliable than ioctl. At least it
> apparently lets us do arbitrarily peculiar things :-)
It's going to give you strict typing, and extensible attributes for
the configuration attributes you define. So if you determine later
"oh we need to add this knob for changing X" you can do that without
breaking the existing interface. With ioctl() that is usually
impossible or unreasonably hard to accomplish.
Try not to get discouraged, give it a shot :)
On Wed, Jan 18, 2006 at 04:43:31PM -0800, Bryan O'Sullivan wrote:
> Opening the /dev/ipath special file assigns an appropriate free
> unit (chip) and port (context on a chip) to a user process.
Shouldn't you just open the proper chip device and port device itself?
That drops one ioctl.
> Think of it as similar to /dev/ptmx for ttys, except there isn't
> a devpts-like filesystem behind it. Once a process has
> opened /dev/ipath, it needs to find out which unit and port it
> has opened, so that it can access other attributes in /sys. To
> do this, we provide a GETPORT ioctl.
> USERINIT and BASEINFO work with mmap to set up direct access to
> the hardware for user processes. We intend to turn these into a
> single ioctl, USERINIT. This copies a substantial amount of
> information to and from userspace.
Why not just use mmap? What's the special needs?
> RCVCTRL enables/disables receipt of packets.
sysfs file.
> SET_PKEY sets a partition key, essentially telling hardware
> which packets are interesting to userspace.
sysfs file.
> UPDM_TID and FREE_TID are used for RDMA context management.
sysfs files.
> WAIT waits for incoming packets, and can clearly be replaced by
> file_ops->poll.
Use poll.
> GETCOUNTERS, GETUNITCOUNTERS and GETSTATS can all be replaced by
> files in sysfs.
good.
> For subnet management:
>
> GETLID, SET_LID, SET_MTU, SET_GUID, SET_MLID, GET_MLID,
> GET_DEVSTATUS, GET_PORTINFO and GET_NODEINFO can all be replaced
> by files in sysfs.
>
> SET_LINKSTATE changes the link state.
>
> SEND_SMA_PKT and RCV_SMA_PKT send and receive subnet management
> packets. I *think* they could be replaced by read and write
> methods on a new special file, although the semantics aren't a
> super-clean match.
Use netlink for subnet stuff.
> For diagnostics:
>
> DIAGENTER and DIAGLEAVE put the driver into and out of diag
> mode. These could be replaced by open/close of a special file.
Use debugfs.
> DIAGREAD and DIAGWRITE perform direct accesses to the device's
> PCI memory space. I think these could be replaced by read and
> write, but they are again subject to the make-coworker-barf
> problem.
Use debugfs.
> HTREAD and HTREAD perform direct accesses to the device's PCI
> config space. Same disagreement problem as DIAGREAD and
> DIAGWRITE.
Use the pci sysfs config files, don't duplicate existing functionality.
> SEND_DIAG_PKT can be replaced with whatever sends and receives
> subnet management packets, as above.
netlink or debugfs.
Hope this helps,
greg k-h
On Wed, Jan 18, 2006 at 05:17:20PM -0800, Bryan O'Sullivan wrote:
> On Wed, 2006-01-18 at 16:53 -0800, Greg KH wrote:
>
> > Use the firmware subsystem for this. It uses sysfs so ioctl needed at
> > all.
>
> OK. Would I be correct in thinking that drivers/firmware/dcdbas.c is a
> reasonable model implementation to follow?
No. Pick a driver that has a backing device, like the wireless drivers
that use it. That Dell bios driver has had more looney extensions than
I can shake a stick at...
thanks,
greg k-h
Greg KH <[email protected]> wrote:
>
Sorry for sticking my head in a beehive, but. Stand back and look at it:
> Shouldn't you just open the proper chip device and port device itself?
> Why not just use mmap? What's the special needs?
> sysfs file.
> Use poll.
> Use netlink for subnet stuff.
> Use debugfs.
> Use the pci sysfs config files, don't duplicate existing functionality.
> netlink or debugfs.
For a driver-bodging interface design, this is simply nutty.
And it makes the driver developer learn a pile of extra stuff and it
introduces lots of linkages everywhere and heaven knows what the driver's
userspace interface description ends up looking like.
ioctl() would have to be pretty darn bad to be worse than all this random
stuff.
Just saying...
On Wed, Jan 18, 2006 at 07:49:11PM -0800, Andrew Morton wrote:
> Greg KH <[email protected]> wrote:
> >
>
> Sorry for sticking my head in a beehive, but. Stand back and look at it:
>
> > Shouldn't you just open the proper chip device and port device itself?
> > Why not just use mmap? What's the special needs?
> > sysfs file.
> > Use poll.
> > Use netlink for subnet stuff.
> > Use debugfs.
> > Use the pci sysfs config files, don't duplicate existing functionality.
> > netlink or debugfs.
>
> For a driver-bodging interface design, this is simply nutty.
One can rightfully argue that they are doing some huge messy things, and
deserve the extra mess if they persist in trying to do it.
> And it makes the driver developer learn a pile of extra stuff and it
> introduces lots of linkages everywhere and heaven knows what the driver's
> userspace interface description ends up looking like.
>
> ioctl() would have to be pretty darn bad to be worse than all this random
> stuff.
It is. It's giving any driver writer the ability to pretty much create
as many different and new and incompatible system calls directly into
the kernel, making their driver "just a little different" from every
other type of driver. Do you really feel confident in allowing this?
I sure do not.
But if they use the interfaces that are present in the kernel (sysfs,
debugfs, netlink, firmware interface), their driver will automatically
work with the already-written userspace tools and their driver will
usually not contain nasty bugs that show up on 64->32bit issues, and
security problems where every user can mess with things they should not
(like lots of ioctls have been known to have in the past.)
We are trying very hard here to make it easier on both the users and the
driver writers (that's why we wrote that infrastructure in the first
place.)
thanks,
greg k-h
On Wed, 2006-01-18 at 18:57 -0800, Greg KH wrote:
> Shouldn't you just open the proper chip device and port device itself?
> That drops one ioctl.
There isn't usually a "right" chip device and port. On a NUMA system,
you want to open the chip that is topologically closest to you, but
failing that, you want to open something that will at least work. You
may *also* want to be able to open a specific unit/port pair, but that
would not be the normal mode of operation.
The reason for doing this through a single open syscall, instead of
making userland try each appropriate device in turn, is the same as
why /dev/ptmx exists: it guarantees that userland can't do something
stupid or racy. The driver checks all units and ports under a single
mutex, so it doesn't have to retry to see if something got closed behind
its back, for example.
> Why not just use mmap? What's the special needs?
mmap just maps the hardware MMIO area into user memory. The ioctl (or
netlink message, or whatever it's going to be) does quite a lot more,
such as tell the chip where user buffers are.
> > RCVCTRL enables/disables receipt of packets.
>
> sysfs file.
>
> > SET_PKEY sets a partition key, essentially telling hardware
> > which packets are interesting to userspace.
>
> sysfs file.
>
> > UPDM_TID and FREE_TID are used for RDMA context management.
>
> sysfs files.
Really? Not netlink messages for these? It is rightly only the process
that has a unit/port open that should be able to modify these; can I
enforce that through sysfs without jumping through too many hoops?
> Use netlink for subnet stuff.
OK.
> > For diagnostics:
> Use debugfs.
Ah, yes.
> Use the pci sysfs config files, don't duplicate existing functionality.
OK.
> Hope this helps,
Yes, it does. There's such a profusion of disconnected interfaces in
2.6 for driver authors to get their heads around, it is a big help to
get some directions through the thicket.
Thanks,
<b
--
Bryan O'Sullivan <[email protected]>
On Wed, 2006-01-18 at 17:17 -0800, David S. Miller wrote:
> It's going to give you strict typing, and extensible attributes for
> the configuration attributes you define. So if you determine later
> "oh we need to add this knob for changing X" you can do that without
> breaking the existing interface.
Wow. OK, that is not immediately obvious from reading the code. The
only modules in drivers/ that seem to use netlink are iscsi, connector,
and w1. It's more extensive in net/, I see.
> Try not to get discouraged, give it a shot :)
It's not obvious what chunk of the the tree is a good example to follow.
Just look what happened when I suggested to Greg that I use the Dell
firmware loader as an example :-)
The closest approximation I can find to documentation is something Neil
Horman wrote over a year ago:
http://people.redhat.com/nhorman/papers/netlink.pdf
And a "this module does a particularly natty job that all coders would
do well to emulate" pointer would be most welcome.
I notice that libnetlink appears to have disappeared without a trace,
along with Alexey.
<b
--
Bryan O'Sullivan <[email protected]>
On Wed, Jan 18, 2006 at 09:02:37PM -0800, Bryan O'Sullivan wrote:
> On Wed, 2006-01-18 at 18:57 -0800, Greg KH wrote:
>
> > Shouldn't you just open the proper chip device and port device itself?
> > That drops one ioctl.
>
> There isn't usually a "right" chip device and port. On a NUMA system,
> you want to open the chip that is topologically closest to you, but
> failing that, you want to open something that will at least work. You
> may *also* want to be able to open a specific unit/port pair, but that
> would not be the normal mode of operation.
>
> The reason for doing this through a single open syscall, instead of
> making userland try each appropriate device in turn, is the same as
> why /dev/ptmx exists: it guarantees that userland can't do something
> stupid or racy. The driver checks all units and ports under a single
> mutex, so it doesn't have to retry to see if something got closed behind
> its back, for example.
Ok, that's fair enough. But if you want to do something like ptys, then
why not just have your own filesystem for this driver?
> > Why not just use mmap? What's the special needs?
>
> mmap just maps the hardware MMIO area into user memory. The ioctl (or
> netlink message, or whatever it's going to be) does quite a lot more,
> such as tell the chip where user buffers are.
Ok.
> > > UPDM_TID and FREE_TID are used for RDMA context management.
> >
> > sysfs files.
>
> Really? Not netlink messages for these? It is rightly only the process
> that has a unit/port open that should be able to modify these; can I
> enforce that through sysfs without jumping through too many hoops?
I really don't know your application enough to be sure. If you want to
use netlink, that's fine too.
> Yes, it does. There's such a profusion of disconnected interfaces in
> 2.6 for driver authors to get their heads around, it is a big help to
> get some directions through the thicket.
Well, for 99% of the drivers, there is no problem, as there is already a
specified and documented way to interact (like network, tty, block,
etc.) You are just making your own type of special interface up as you
go, so the complexity is also there (this complexity would normally be
in some core code, which I am hoping that your code will turn into for
other devices of the same type, right?)
thanks,
greg k-h
On Wed, Jan 18, 2006 at 09:17:01PM -0800, Bryan O'Sullivan wrote:
> On Wed, 2006-01-18 at 17:17 -0800, David S. Miller wrote:
>
> > It's going to give you strict typing, and extensible attributes for
> > the configuration attributes you define. So if you determine later
> > "oh we need to add this knob for changing X" you can do that without
> > breaking the existing interface.
>
> Wow. OK, that is not immediately obvious from reading the code. The
> only modules in drivers/ that seem to use netlink are iscsi, connector,
> and w1. It's more extensive in net/, I see.
The attribute stuff is pretty new, and I do not think any code in
drivers/ uses it yet. But it is well documented in
include/net/netlink.h, have you looked at that?
> > Try not to get discouraged, give it a shot :)
>
> It's not obvious what chunk of the the tree is a good example to follow.
> Just look what happened when I suggested to Greg that I use the Dell
> firmware loader as an example :-)
Well, it is good that you asked, far too many people do not. And others
wonder why we are so insistant on everyone doing things properly in all
parts of the kernel, it's because of this reason.
Which reminds me to go back and look at that dell driver again...
thanks,
greg k-h
On Wed, 2006-01-18 at 21:39 -0800, Greg KH wrote:
> Ok, that's fair enough. But if you want to do something like ptys, then
> why not just have your own filesystem for this driver?
If you think it's appropriate to implement a new filesystem to replace a
single ioctl that returns two integers, we can probably do that, but
more realistically, the GETPORT ioctl can probably live a long and
untroubled life as another netlink message.
> You are just making your own type of special interface up as you
> go, so the complexity is also there (this complexity would normally be
> in some core code, which I am hoping that your code will turn into for
> other devices of the same type, right?)
The most important chunk of likely common code I can see at the moment
is the stuff for bodging user page mappings that we got hammered over
already. The drivers/infiniband/ tree already has code that does
something like this, and a few other not-yet-in-tree network drivers
that support RDMA have similar needs, too.
<b
"Bryan O'Sullivan" <[email protected]> writes:
> When I posted the last round of ipath driver code for review, people
> objected to the number of ioctls we had. I'd like to get feedback on
> what would be acceptable replacements.
Roland you know the RDMA model best, are things so tied to the current
crop of infiniband protocols that what the ipath code wants to
do is not covered?
They clearly need subsystem support and what they are trying to do
either isn't covered or they don't see how to use what is there.
Do the infiniband verbs not allow dealing with a unreliable datagram
protocol?
> We have four kinds of ioctl right now:
>
> * Interfacing with userspace
> * Infiniband subnet management
> * Flash/EEPROM management
> * Diagnostics
>
> There are currently 36 ioctls in total. I think that I can reduce this
> number dramatically, but we're having some contentious internal debate
> about whether and how some of the ioctls should be replaced. I'd like
> to see what's most likely to get accepted. Obviously, we'd prefer the
> number to be zero, but I don't think we can do that without submitting a
> driver that isn't very useful.
>
> Unless I indicate otherwise, I cannot think of clean replacements for
> the ioctls listed below, and would appreciate suggestions.
>
> For user access:
>
> Opening the /dev/ipath special file assigns an appropriate free
> unit (chip) and port (context on a chip) to a user process.
> Think of it as similar to /dev/ptmx for ttys, except there isn't
> a devpts-like filesystem behind it. Once a process has
> opened /dev/ipath, it needs to find out which unit and port it
> has opened, so that it can access other attributes in /sys. To
> do this, we provide a GETPORT ioctl.
We need some generic subsystem support to do this. If the kernel
ib/rdma support is not enough to do this we need to build something.
Dealing with NUMA affinity should not be something drivers need to
invent.
> USERINIT and BASEINFO work with mmap to set up direct access to
> the hardware for user processes. We intend to turn these into a
> single ioctl, USERINIT. This copies a substantial amount of
> information to and from userspace.
I'm not certain but the concept sounds generic even if the information
is not. This sounds like a job for the ib/rdma/kernel-bypass networking
subsystem.
> RCVCTRL enables/disables receipt of packets.
Again this is a generic problem, and the generic interfaces are broken
if you can't do this. I know the linux network stack already provides
this.
> SET_PKEY sets a partition key, essentially telling hardware
> which packets are interesting to userspace.
I'm pretty certain this should be something that should be set
at open time.
> UPDM_TID and FREE_TID are used for RDMA context management.
>
> WAIT waits for incoming packets, and can clearly be replaced by
> file_ops->poll.
>
> GETCOUNTERS, GETUNITCOUNTERS and GETSTATS can all be replaced by
> files in sysfs.
This whole section just cries out for a network/rdma/ib/kernel-by-pass
layer that is that any interesting network driver can use.
A device driver should not need to invent the interfaces for this
kind of functionality.
> For subnet management:
>
> GETLID, SET_LID, SET_MTU, SET_GUID, SET_MLID, GET_MLID,
> GET_DEVSTATUS, GET_PORTINFO and GET_NODEINFO can all be replaced
> by files in sysfs.
>
> SET_LINKSTATE changes the link state.
>
> SEND_SMA_PKT and RCV_SMA_PKT send and receive subnet management
> packets. I *think* they could be replaced by read and write
> methods on a new special file, although the semantics aren't a
> super-clean match.
Infiniband stack, it's there use it.
If the Infiniband stack is too ugly to use or it is missing features
then we need to fix it. So please complain about why you are have
a hard time using the in-kernel infiniband stack, for this.
> For EEPROM/flash management:
>
> READ_EEPROM reads the flash. WRITE_EEPROM writes it. I don't
> see a standard way of doing this in the kernel; many drivers
> provide their own private ioctls, some on dedicated special
> files. I think that using read and write instead would be okay
> (with a small qualm about semantics), but this idea makes an
> influential coworker barf violently. I can't see how we could
> use the ethtool flash interface: the low-level driver doesn't
> look like a regular net device, and we support partial updates
> of the flash.
There are a couple of choices here. Off the top of my head.
Have your driver support an i2c device, have your driver export an
mtd device, and ethtool are the most standard. Partly it depends
on what you are trying to do.
Partial updates are not a problem. Just keep a cached copy and only
write to those bytes that have changed.
> For diagnostics:
>
> DIAGENTER and DIAGLEAVE put the driver into and out of diag
> mode. These could be replaced by open/close of a special file.
This one does sound global to a device and a trivial parameter.
sysfs does sound like the proper interface here. That makes it
script controllable etc.
> DIAGREAD and DIAGWRITE perform direct accesses to the device's
> PCI memory space. I think these could be replaced by read and
> write, but they are again subject to the make-coworker-barf
> problem.
mmap(/dev/mem)
There is also an interface in /proc or /sys I forget which
that let's you select the individual bar for a pci device.
You don't need to do anything, in your driver to support this.
> HTREAD and HTREAD perform direct accesses to the device's PCI
> config space. Same disagreement problem as DIAGREAD and
> DIAGWRITE.
Again. This is generic functionality already provided by the kernel,
no need to implement anything. lspci/setpci already handle this quite
well.
> SEND_DIAG_PKT can be replaced with whatever sends and receives
> subnet management packets, as above.
>
> DIAG_RD_I2C is synonymous with READ_EEPROM, and will go away.
>
> Depending on how you look at it, we can slim our list of ioctls down to
> somewhere between 6 and 10. This isn't zero, but it's not 36, either.
> What do people think?
It's getting there. :)
Eric
From: [email protected] (Eric W. Biederman)
Date: Thu, 19 Jan 2006 01:25:39 -0700
> mmap(/dev/mem)
> There is also an interface in /proc or /sys I forget which
> that let's you select the individual bar for a pci device.
> You don't need to do anything, in your driver to support this.
Yes, please use /proc/bus/pci/* device file mmap()s or even
better the PCI ones under /sys work too.
I think libpci even has some help for this.
On Thu, 2006-01-19 at 01:25 -0700, Eric W. Biederman wrote:
> Do the infiniband verbs not allow dealing with a unreliable datagram
> protocol?
Eric, I think you are misunderstanding what we are actually trying to
do. We already implement IB verbs and the various IB networking
protocols in our drivers, at a layer that is not at all related to the
one that is currently festooned with ioctls.
The ioctl discussion pertains to lower-level direct user access to the
hardware, for a protocol that bypasses the entire IB stack and just
happens to send UD-compliant datagrams over the wire.
I'm actually pretty satisfied with the feedback I've already gotten from
Greg K-H and davem.
> We need some generic subsystem support to do this.
I am more than happy to put together generic support, provided I see
other drivers that could take advantage of it being considered for
submission. Right now, I do not - in general - see this happening.
I know that some other drivers need to do user page pinning, and I'm
happy to try to find a generic solution that is common to IB and drivers
unrelated to IB.
> > RCVCTRL enables/disables receipt of packets.
>
> Again this is a generic problem, and the generic interfaces are broken
> if you can't do this.
The SIOCSIFFLAGS ioctl, which I assume is the generic interface you
refer to (it's the one used by iproute, at any rate), has poor overlap
with what we need (it supports a pile of stuff that we don't care about,
and we require a pile of stuff it doesn't support), and I don't feel
inclined to try using it in any case.
> > SET_PKEY sets a partition key, essentially telling hardware
> > which packets are interesting to userspace.
>
> I'm pretty certain this should be something that should be set
> at open time.
It might be possible to make it fit into whatever replaces USERINIT, or
else we can use a netlink message of its own.
> > UPDM_TID and FREE_TID are used for RDMA context management.
> >
> > WAIT waits for incoming packets, and can clearly be replaced by
> > file_ops->poll.
> >
> > GETCOUNTERS, GETUNITCOUNTERS and GETSTATS can all be replaced by
> > files in sysfs.
>
> This whole section just cries out for a network/rdma/ib/kernel-by-pass
> layer that is that any interesting network driver can use.
No, it doesn't. Our chip's approach to remote memory access doesn't
even slightly resemble that of other comparable chips. In addition, our
counters are entirely device-specific, and I'm already planning to move
them to sysfs. The sysfs move gets them out of ioctl-land, and there's
no point in trying to do anything beyond that.
> Infiniband stack, it's there use it.
No. If you're running a full IB stack, we provide the usual IB subnet
management facilities, and you can run OpenSM to manage your subnet. If
you're *not*, which is the case I'm concerned with here, it makes no
sense to replicate the byzantine IB management interfaces in order to do
a handful of simple things that aren't even tied to the higher-level IB
protocols.
> There are a couple of choices here.
Yes, we'll use the firmware interface, as Greg suggested.
> There is also an interface in /proc or /sys I forget which
> that let's you select the individual bar for a pci device.
Yes, we'll use that.
Thanks for your comments.
<b
"Bryan O'Sullivan" <[email protected]> writes:
> On Thu, 2006-01-19 at 01:25 -0700, Eric W. Biederman wrote:
>
>> Do the infiniband verbs not allow dealing with a unreliable datagram
>> protocol?
>
> Eric, I think you are misunderstanding what we are actually trying to
> do. We already implement IB verbs and the various IB networking
> protocols in our drivers, at a layer that is not at all related to the
> one that is currently festooned with ioctls.
>
> The ioctl discussion pertains to lower-level direct user access to the
> hardware, for a protocol that bypasses the entire IB stack and just
> happens to send UD-compliant datagrams over the wire.
I'm surprised. I didn't think your native datagrams were complaint
above the link level with any of the IB protocols in the kernel.
In any case that is not what I am saying. I am saying that I think
that if the IB/rdma/networking layer does not do a good job of
supporting you it is a failure there. Your driver looks ugly because
there is not a sufficiently good helper layer. For high performance
non-IP targeted networking cards you aren't doing anything terribly
exotic. Could you please detail why you can't use the IB/rdma
whatever helper layer, is insufficient to do what you need.
If it is byzantine and heavy weight that concern needs to be
addressed. I agree the normal software stack is pretty tall.
> I'm actually pretty satisfied with the feedback I've already gotten from
> Greg K-H and davem.
>
>> We need some generic subsystem support to do this.
>
> I am more than happy to put together generic support, provided I see
> other drivers that could take advantage of it being considered for
> submission. Right now, I do not - in general - see this happening.
Right now it largely seems to be a chicken and the egg problem.
There is a large portion of the HPC community that doesn't believe
they are interesting to the rest of the world or that the rest of
the world is interesting to them so they do they own thing leading
to support problems.
There are other drivers for linux right now, that the vendors are not
too concerned about closed source that potentially code. I can think
of at least 3 other networking fabrics out there. Heck the kernel
already has a myrinet driver in it. Currently it only supports
I also know there is another infiniband adapter that only provides
raw packet access like yours does.
I'm sick and tired of drivers having to invent all of the user space
glue elements, for HPC.
> I know that some other drivers need to do user page pinning, and I'm
> happy to try to find a generic solution that is common to IB and drivers
> unrelated to IB.
Which is the RDMA thing. And looking at the code and I don't see how
>> > RCVCTRL enables/disables receipt of packets.
>>
>> Again this is a generic problem, and the generic interfaces are broken
>> if you can't do this.
>
> The SIOCSIFFLAGS ioctl, which I assume is the generic interface you
> refer to (it's the one used by iproute, at any rate), has poor overlap
> with what we need (it supports a pile of stuff that we don't care about,
> and we require a pile of stuff it doesn't support), and I don't feel
> inclined to try using it in any case.
But SIOCSIFFLAGS is not implemented by a driver. It is implemented
by the networking subsystem. It requires a network device to make
sense in any case.
>> > SET_PKEY sets a partition key, essentially telling hardware
>> > which packets are interesting to userspace.
>>
>> I'm pretty certain this should be something that should be set
>> at open time.
>
> It might be possible to make it fit into whatever replaces USERINIT, or
> else we can use a netlink message of its own.
>
>> > UPDM_TID and FREE_TID are used for RDMA context management.
>> >
>> > WAIT waits for incoming packets, and can clearly be replaced by
>> > file_ops->poll.
>> >
>> > GETCOUNTERS, GETUNITCOUNTERS and GETSTATS can all be replaced by
>> > files in sysfs.
>>
>> This whole section just cries out for a network/rdma/ib/kernel-by-pass
>> layer that is that any interesting network driver can use.
>
> No, it doesn't. Our chip's approach to remote memory access doesn't
> even slightly resemble that of other comparable chips. In addition, our
> counters are entirely device-specific, and I'm already planning to move
> them to sysfs. The sysfs move gets them out of ioctl-land, and there's
> no point in trying to do anything beyond that.
Agreed, counters and sysfs are a good match. But the generic
networking layer already has support for counters that are different
for every device. That helper really needs to export those counters
to sysfs as well as ethtool but the support already exists for more
typical networking.
The problem actually gets pretty simple when you need to design an
interface to support generic kernel-by-pass over using arbitrary
protocols. There are so few things in common those things that
are in common stick out.
>> Infiniband stack, it's there use it.
>
> No. If you're running a full IB stack, we provide the usual IB subnet
> management facilities, and you can run OpenSM to manage your subnet. If
> you're *not*, which is the case I'm concerned with here, it makes no
> sense to replicate the byzantine IB management interfaces in order to do
> a handful of simple things that aren't even tied to the higher-level IB
> protocols.
Is it the stack that is byzantine? Or the interface too it.
What I thinking untimately is there should be something about as
simple as af_packet in the kernel (but at the IB/rdma) layer that
gives you the help you need.
>> There are a couple of choices here.
>
> Yes, we'll use the firmware interface, as Greg suggested.
I will have to look. That one doesn't sound familiar... Do
we really have 4 wheels in the kernel?
>> There is also an interface in /proc or /sys I forget which
>> that let's you select the individual bar for a pci device.
>
> Yes, we'll use that.
>
> Thanks for your comments.
Welcome, and thanks for your patience with this process.
Eric
On Thu, 2006-01-19 at 11:20 -0700, Eric W. Biederman wrote:
> For high performance
> non-IP targeted networking cards you aren't doing anything terribly
> exotic.
True.
> Could you please detail why you can't use the IB/rdma
> whatever helper layer, is insufficient to do what you need.
There really isn't an RDMA helper layer. The fact that the IB headers
live in include/rdma is, as best as I can tell, an artefact of Roland
being accommodating to someone's suggestion when he was going through
the same process with the IB tree as we are now with our driver.
> Right now it largely seems to be a chicken and the egg problem.
> There is a large portion of the HPC community that doesn't believe
> they are interesting to the rest of the world or that the rest of
> the world is interesting to them so they do they own thing leading
> to support problems.
I can't solve that problem. If other vendors don't want to pony up
their driver source and take the same kinds of slings and arrows I'm
doing, I'm not going to do the work to provide them with a generic set
of abstractions to use in their out-of-tree or proprietary drivers.
> Which is the RDMA thing. And looking at the code and I don't see how
Your sentence ends in the middle.
> >> Again this is a generic problem, and the generic interfaces are broken
> >> if you can't do this.
> But SIOCSIFFLAGS is not implemented by a driver.
I can't square these two statements. Can you indicate what you might
have been talking about, if not SIOCSIFFLAGS?
> That helper really needs to export those counters
> to sysfs as well as ethtool but the support already exists for more
> typical networking.
I know about the ethtool interfaces, but we implement only a tiny
fraction of the stuff that is relevant to ethtool at this level of
abstraction.
> Is it the stack that is byzantine? Or the interface too it.
Both.
<b
Eric W. Biederman wrote:
>>No. If you're running a full IB stack, we provide the usual IB subnet
>>management facilities, and you can run OpenSM to manage your subnet. If
>>you're *not*, which is the case I'm concerned with here, it makes no
>>sense to replicate the byzantine IB management interfaces in order to do
>>a handful of simple things that aren't even tied to the higher-level IB
>>protocols.
>
> Is it the stack that is byzantine? Or the interface too it.
> What I thinking untimately is there should be something about as
> simple as af_packet in the kernel (but at the IB/rdma) layer that
> gives you the help you need.
I'm not familiar with the driver, but would the lower level verbs interfaces
work for this? Could you just post whatever datagrams that you want directly to
your management QPs?
- Sean
On Thu, 2006-01-19 at 10:50 -0800, Sean Hefty wrote:
> I'm not familiar with the driver, but would the lower level verbs interfaces
> work for this? Could you just post whatever datagrams that you want directly to
> your management QPs?
Our lowest-level driver works in the absence of any IB support being
compiled into the kernel, so in that situation, there are no QPs or any
other management infrastructure present at all. All of that stuff lives
in a higher layer, in which situation the cut-down subnet management
agent doesn't get used, and something like OpenSM is more appropriate.
<b
"Bryan O'Sullivan" <[email protected]> writes:
> On Thu, 2006-01-19 at 11:20 -0700, Eric W. Biederman wrote:
>
>> For high performance
>> non-IP targeted networking cards you aren't doing anything terribly
>> exotic.
>
> True.
>
>> Could you please detail why you can't use the IB/rdma
>> whatever helper layer, is insufficient to do what you need.
>
> There really isn't an RDMA helper layer. The fact that the IB headers
> live in include/rdma is, as best as I can tell, an artefact of Roland
> being accommodating to someone's suggestion when he was going through
> the same process with the IB tree as we are now with our driver.
The fact that this didn't go farther is part of my complaint,
and part of what needs to be refactored.
>> Right now it largely seems to be a chicken and the egg problem.
>> There is a large portion of the HPC community that doesn't believe
>> they are interesting to the rest of the world or that the rest of
>> the world is interesting to them so they do they own thing leading
>> to support problems.
>
> I can't solve that problem. If other vendors don't want to pony up
> their driver source and take the same kinds of slings and arrows I'm
> doing, I'm not going to do the work to provide them with a generic set
> of abstractions to use in their out-of-tree or proprietary drivers.
Agreed. Part of the problem is the IB layer is insufficient, or
at least you perceive it that way. At that level if you can express
your problems we can get the IB layer fixed.
As for other drivers I know I can get modifiable source, and I know I
can get user pressure to hook it into a standard interface if there
is one.
Most of this should have been sorted out with getting a solid infiniband
layer into the kernel. Since it didn't I'm at least want to get the
interface right for next time.
>> Which is the RDMA thing. And looking at the code and I don't see how
>
> Your sentence ends in the middle.
Sorry. I was in the middle of noticing how incomplete the RDMA layer
was. I guess that just makes my sentence
>> >> Again this is a generic problem, and the generic interfaces are broken
>> >> if you can't do this.
>
>> But SIOCSIFFLAGS is not implemented by a driver.
>
> I can't square these two statements. Can you indicate what you might
> have been talking about, if not SIOCSIFFLAGS?
I was saying that this functionality sounds like something that should
be part of a generic layer. The IFF_UP from SIOCSIFFLAGS bit seems to
behave exactly how you want. But this maps to the network driver
methods open and close. No driver implements SIOCSIFFFLAGS.
Basically my point was that the helper layers appear insufficient
to your needs.
>> Is it the stack that is byzantine? Or the interface too it.
>
> Both.
This is my other point. Your driver puts packets on infiniband.
Your hardware potentially supports more IB protocols than the driver
for mellanox's hardware The IB stack does not serve you well.
Except not being a member of the IB verbs camp there is nothing
your hardware does that is exotic enough for the IB layer to
fall down. All of the kernel-bypass is used for other protocols
to IB.
So right now it looks like 2 things going on.
1) The IB stack poorly supports your driver.
- IB stack problem. If you could help point out what
is wrong with the IB stack that would be great.
2) ipath doesn't seem to want to use the IB stack as a helper
layer for it's fast path protocol.
My sympathies are with you about the IB stack, it integrates
rather badly with the networking layer, so likely has
other issues.
But you at least have to be willing to budge a little or
these hard problems can't be fixed.
For those who need the buzz words to understand what is going
on the ipath hardware largely does stateless offload for IB while
the mellanox hardware does whole protocol offload. Which would
mean if this was a normal network driver ipath good mellanox bad.
So something is broken if the our generic layers don't support the
kind of hardware the linux kernel developers profess to prefer.
Eric
"Bryan O'Sullivan" <[email protected]> writes:
> On Thu, 2006-01-19 at 10:50 -0800, Sean Hefty wrote:
>
>> I'm not familiar with the driver, but would the lower level verbs interfaces
>> work for this? Could you just post whatever datagrams that you want directly
> to
>> your management QPs?
>
> Our lowest-level driver works in the absence of any IB support being
> compiled into the kernel, so in that situation, there are no QPs or any
> other management infrastructure present at all. All of that stuff lives
> in a higher layer, in which situation the cut-down subnet management
> agent doesn't get used, and something like OpenSM is more appropriate.
Ok this is one piece of the puzzle. At your lowest level your hardware
does not have QP's but it does have something similar to isolate a userspace
process correct?
Which sounds like one problem with the IB layer is that it assumes QPs instead
of a slight abstraction of that concept.
Eric
> For those who need the buzz words to understand what is going
> on the ipath hardware largely does stateless offload for IB while
> the mellanox hardware does whole protocol offload. Which would
> mean if this was a normal network driver ipath good mellanox bad.
>
Are you sure about this? I would think if ipath does IB RC service in
hardware, its no where near stateless offload. I don't think this is a
fair comparison.
Steve.
Bryan O'Sullivan wrote:
> Our lowest-level driver works in the absence of any IB support being
> compiled into the kernel, so in that situation, there are no QPs or any
> other management infrastructure present at all. All of that stuff lives
> in a higher layer, in which situation the cut-down subnet management
> agent doesn't get used, and something like OpenSM is more appropriate.
I'm struggling to understand what your card does then. From this, it sounds
like a standard network card that just happens to use IB physicals. Do you just
send raw packets? How is the LRH formatted by your card? I.e. what's setting
up the dlid, slid, vl, etc.? Can your card interoperate with other IB devices
on the network when running in this mode?
- Sean
On Thu, 2006-01-19 at 13:08 -0800, Sean Hefty wrote:
> I'm struggling to understand what your card does then. From this, it sounds
> like a standard network card that just happens to use IB physicals.
It has typical features of a standard network card, while also
supporting direct user access to the hardware. We eschew the
offload-as-much-as-possible approach that other vendors take.
> Do you just send raw packets?
We certainly can do that. The hardware doesn't need to do much more, in
fact.
> How is the LRH formatted by your card? I.e. what's setting
> up the dlid, slid, vl, etc.?
This is all done in software. The low-level driver and hardware fill
out enough of the IB UD protocol headers to put packets on the wire that
an IB switch will route. The higher-level layer is responsible for the
full IB protocol suite and the driver-side interfaces to the various
OpenIB userspace APIs.
> Can your card interoperate with other IB devices
> on the network when running in this mode?
Yes. It can do both the low-level wonkery and regular IB at the same
time.
<b
On Thu, 2006-01-19 at 13:31 -0700, Eric W. Biederman wrote:
> Ok this is one piece of the puzzle. At your lowest level your hardware
> does not have QP's but it does have something similar to isolate a userspace
> process correct?
Right. We implement almost none of the IB protocols in hardware.
<b
On Thu, 2006-01-19 at 13:29 -0700, Eric W. Biederman wrote:
> Agreed. Part of the problem is the IB layer is insufficient, or
> at least you perceive it that way. At that level if you can express
> your problems we can get the IB layer fixed.
Our low-level driver is not IB, doesn't implement IB, and doesn't care
about IB. Our upper-level driver implements IB, and interfaces to the
existing IB tree.
> Except not being a member of the IB verbs camp there is nothing
> your hardware does that is exotic enough for the IB layer to
> fall down.
We implement IB verbs just fine, both in the kernel and userspace.
> 1) The IB stack poorly supports your driver.
> - IB stack problem. If you could help point out what
> is wrong with the IB stack that would be great.
I have no issue with it. We already act as a provider to it, in our
higher-layer driver code.
We have some user page pinning code that is clearly similar in purpose,
and that I want to refactor in a helpful way.
We have UD and RC protocol engines that could profitably be moved out of
our driver and into the IB layer at some future point in time, should
some other device ever come along that could use them.
> For those who need the buzz words to understand what is going
> on the ipath hardware largely does stateless offload for IB while
> the mellanox hardware does whole protocol offload.
Our hardware actually does no offload whatsoever. That's why we are (a)
fast (b) flexible and (c) somewhat big and unusual compared to other IB
drivers.
<b
On Wed, Jan 18, 2006 at 09:53:08PM -0800, Bryan O'Sullivan wrote:
> On Wed, 2006-01-18 at 21:39 -0800, Greg KH wrote:
>
> > Ok, that's fair enough. But if you want to do something like ptys, then
> > why not just have your own filesystem for this driver?
>
> If you think it's appropriate to implement a new filesystem to replace a
> single ioctl that returns two integers, we can probably do that, but
> more realistically, the GETPORT ioctl can probably live a long and
> untroubled life as another netlink message.
Well it only takes about 250 lines to make a new fs these days, but a
single netlink message would probably be smaller :)
> > You are just making your own type of special interface up as you
> > go, so the complexity is also there (this complexity would normally be
> > in some core code, which I am hoping that your code will turn into for
> > other devices of the same type, right?)
>
> The most important chunk of likely common code I can see at the moment
> is the stuff for bodging user page mappings that we got hammered over
> already. The drivers/infiniband/ tree already has code that does
> something like this, and a few other not-yet-in-tree network drivers
> that support RDMA have similar needs, too.
The RDMA-loving people need to get together and hammer out a proposal
that the network people can laugh at and shoot down all at once :)
Ok, maybe not shoot down, but they do need to get together and come up
with some kind of solution, add-hok implementations in a bunch of
different drivers, in a bunch of different ways is not the proper thing
to do, no matter _how_ different the hardware works at the lower levels.
thanks,
greg k-h
On Thu, 2006-01-19 at 14:57 -0800, Greg KH wrote:
> The RDMA-loving people need to get together and hammer out a proposal
> that the network people can laugh at and shoot down all at once :)
We are not really in the RDMA camp. Our facility looks more like "when
this kind of message comes in, be sure that it shows up at this point in
my address space", which does not match RDMA semantics.
Also, RDMA's mother smells of elderberries, in my personal opinion.
<b
Bryan O'Sullivan wrote:
> We are not really in the RDMA camp. Our facility looks more like "when
> this kind of message comes in, be sure that it shows up at this point in
> my address space", which does not match RDMA semantics.
A lot of people mean QP-like semantics when they talk about "RDMA", rather than
the RDMA operation itself. I.e. pre-posted receive buffers associated with a
particular user-space process.
That aside, conceptually, I see little difference between RDMA semantics versus
the facility that you describe. The main difference is the complexity of the
header and the checks done against it.
- Sean
Eric> Roland you know the RDMA model best, are things so tied to
Eric> the current crop of infiniband protocols that what the ipath
Eric> code wants to do is not covered?
Eric> They clearly need subsystem support and what they are trying
Eric> to do either isn't covered or they don't see how to use what
Eric> is there. Do the infiniband verbs not allow dealing with a
Eric> unreliable datagram protocol?
I think this has been answered already but the issue is really that
the PathScale hardware does not implement RDMA or even any of the
other connection-oriented abstractions that the RDMA layer is designed
for. The hardware has only much lower level capabilities, which
basically can send and receive packets on an IB link.
With those capabilites it is possible to implement IB transports in
software -- so for example RDMA read operations are simulated by
having the CPU on the receiver copy data to send the response.
However that implementation is not going to make good use of the IB
midlayer, which really operates at the abstraction level above the IB
transport.
It's also possible to use the PathScale hardware to directly implement
MPI on top of a protocol optimized specifically for MPI, without using
IB verbs semantics or an IB transport on the wire. But clearly the
userspace interface needed for doing this is not going to match up
very well with a userspace interface for IB verbs (which is at a
different abstraction level).
- R.
I've been flailing away at the ioctls in our driver, with a good degree
of success. However, one in particular is proving tricky:
> Opening the /dev/ipath special file assigns an appropriate free
> unit (chip) and port (context on a chip) to a user process.
> Think of it as similar to /dev/ptmx for ttys, except there isn't
> a devpts-like filesystem behind it. Once a process has
> opened /dev/ipath, it needs to find out which unit and port it
> has opened, so that it can access other attributes in /sys. To
> do this, we provide a GETPORT ioctl.
I still don't see how to replace this with anything else without
performing unnatural acts.
We use struct file's private_data to keep a pointer to the device in
use, which works fine for ioctl.
However, if I'm coming into the kernel over a netlink socket, I have no
obvious way of going from my table of devices to the processes that have
each one open, and I see no evidence that any other device driver tries
to do anything like this either.
Short of keeping a reference to the task_struct in the device, or
walking the sending process's file table if we receive a netlink message
(both of which are disgusting), I see no way to make this ioctl go away.
Am I missing something?
<b
On Wed, Jan 25, 2006 at 02:32:41PM -0800, Bryan O'Sullivan wrote:
> I've been flailing away at the ioctls in our driver, with a good degree
> of success. However, one in particular is proving tricky:
>
> > Opening the /dev/ipath special file assigns an appropriate free
> > unit (chip) and port (context on a chip) to a user process.
> > Think of it as similar to /dev/ptmx for ttys, except there isn't
> > a devpts-like filesystem behind it. Once a process has
> > opened /dev/ipath, it needs to find out which unit and port it
> > has opened, so that it can access other attributes in /sys. To
> > do this, we provide a GETPORT ioctl.
>
> I still don't see how to replace this with anything else without
> performing unnatural acts.
If this is all it does, why not keep it as a device file, where open()
assigns the resources, read() returns them, and close() frees them? no
ioctl necessary.
Cheers,
Muli
--
Muli Ben-Yehuda
http://www.mulix.org | http://mulix.livejournal.com/
On Thu, 2006-01-26 at 00:43 +0200, Muli Ben-Yehuda wrote:
> If this is all it does, why not keep it as a device file, where open()
> assigns the resources, read() returns them, and close() frees them? no
> ioctl necessary.
Since the char special file doesn't currently implement a read() method,
I can go that way, but the result will either end up being a function
that does a copy_to_user of two bytes, or (if we ever find we need
another ioctl-like thing) it will become an ioctl in all but name.
This is the position the current infiniband code is in. There are
special files with read methods defined that are exactly and precisely
ioctl and nothing else, as far as I can tell, presumably because the
resistance to using ioctl was so high. I'd rather call a spade a spade.
<b