Hi all-
Supporting iopl() in the Linux kernel is becoming a maintainability
problem. As far as I know, DPDK is the only major modern user of
iopl().
After doing some research, DPDK uses direct io port access for only a
single purpose: accessing legacy virtio configuration structures.
These structures are mapped in IO space in BAR 0 on legacy virtio
devices.
There are at least three ways you could avoid using iopl(). Here they
are in rough order of quality in my opinion:
1. Change pci_uio_ioport_read() and pci_uio_ioport_write() to use
read() and write() on resource0 in sysfs.
2. Use the alternative access mechanism in the virtio legacy spec:
there is a way to access all of these structures via configuration
space.
3. Use ioperm() instead of iopl().
We are considering changes to the kernel that will potentially harm
the performance of any program that uses iopl(3) -- in particular,
context switches will become more expensive, and the scheduler might
need to explicitly penalize such programs to ensure fairness. Using
ioperm() already hurts performance, and the proposed changes to iopl()
will make it even worse. Alternatively, the kernel could drop iopl()
support entirely. I will certainly make a change to allow
distributions to remove iopl() support entirely from their kernels,
and I expect that distributions will do this.
Please fix DPDK.
Thanks,
Andy
Hi Andy,
On Thu, Oct 24, 2019 at 09:45:56PM -0700, Andy Lutomirski wrote:
> Hi all-
>
> Supporting iopl() in the Linux kernel is becoming a maintainability
> problem. As far as I know, DPDK is the only major modern user of
> iopl().
>
> After doing some research, DPDK uses direct io port access for only a
> single purpose: accessing legacy virtio configuration structures.
> These structures are mapped in IO space in BAR 0 on legacy virtio
> devices.
>
> There are at least three ways you could avoid using iopl(). Here they
> are in rough order of quality in my opinion:
(...)
I'm just wondering, why wouldn't we introduce a sys_ioport() syscall
to perform I/Os in the kernel without having to play at all with iopl()/
ioperm() ? That would alleviate the need for these large port maps.
Applications that use outb/inb() usually don't need extreme speeds.
Each time I had to use them, it was to access a watchdog, a sensor, a
fan, control a front panel LED, or read/write to NVRAM. Some userland
drivers possibly don't need much more, and very likely run with
privileges turned on all the time, so replacing their inb()/outb() calls
would mostly be a matter of redefining them using a macro to use the
syscall instead.
I'd see an API more or less like this :
int ioport(int op, u16 port, long val, long *ret);
<op> would take values such as INB,INW,INL to fill *<ret>, OUTB,OUTW,OUL
to read from <val>, possibly ORB,ORW,ORL to read, or with <val>, write
back and return previous value to <ret>, ANDB/W/L, XORB/W/L to do the
same with and/xor, and maybe a TEST operation to just validate support
at start time and replace ioperm/iopl so that subsequent calls do not
need to check for errors. Applications could then replace :
ioperm() with ioport(TEST,port,0,0)
iopl() with ioport(TEST,0,0,0)
outb() with ioport(OUTB,port,val,0)
inb() with ({ char val;ioport(INB,port,0,&val);val;})
... and so on.
And then ioperm/iopl can easily be dropped.
Maybe I'm overlooking something ?
Willy
Hello Andy,
On Fri, Oct 25, 2019 at 6:46 AM Andy Lutomirski <[email protected]> wrote:
> Supporting iopl() in the Linux kernel is becoming a maintainability
> problem. As far as I know, DPDK is the only major modern user of
> iopl().
Thanks for reaching out.
Copying our virtio maintainers (Maxime and Tiwei), since they are the
first impacted by such a change.
> After doing some research, DPDK uses direct io port access for only a
> single purpose: accessing legacy virtio configuration structures.
> These structures are mapped in IO space in BAR 0 on legacy virtio
> devices.
>
> There are at least three ways you could avoid using iopl(). Here they
> are in rough order of quality in my opinion:
>
> 1. Change pci_uio_ioport_read() and pci_uio_ioport_write() to use
> read() and write() on resource0 in sysfs.
>
> 2. Use the alternative access mechanism in the virtio legacy spec:
> there is a way to access all of these structures via configuration
> space.
>
> 3. Use ioperm() instead of iopl().
And you come with potential solutions, thanks :-)
We need to look at them and evaluate what is best from our point of view.
See how it impacts our ABI too (we decided on a freeze until 20.11).
> We are considering changes to the kernel that will potentially harm
> the performance of any program that uses iopl(3) -- in particular,
> context switches will become more expensive, and the scheduler might
> need to explicitly penalize such programs to ensure fairness. Using
> ioperm() already hurts performance, and the proposed changes to iopl()
> will make it even worse. Alternatively, the kernel could drop iopl()
> support entirely. I will certainly make a change to allow
> distributions to remove iopl() support entirely from their kernels,
> and I expect that distributions will do this.
>
> Please fix DPDK.
Unfortunately, we are currently closing our rc1 for the 19.11 release.
Not sure who is available, but I suppose we can work on this subject
in the 20.02 release timeframe.
Thanks.
--
David Marchand
On Thu, Oct 24, 2019 at 11:42 PM Willy Tarreau <[email protected]> wrote:
>
> Hi Andy,
>
> On Thu, Oct 24, 2019 at 09:45:56PM -0700, Andy Lutomirski wrote:
> > Hi all-
> >
> > Supporting iopl() in the Linux kernel is becoming a maintainability
> > problem. As far as I know, DPDK is the only major modern user of
> > iopl().
> >
> > After doing some research, DPDK uses direct io port access for only a
> > single purpose: accessing legacy virtio configuration structures.
> > These structures are mapped in IO space in BAR 0 on legacy virtio
> > devices.
> >
> > There are at least three ways you could avoid using iopl(). Here they
> > are in rough order of quality in my opinion:
> (...)
>
> I'm just wondering, why wouldn't we introduce a sys_ioport() syscall
> to perform I/Os in the kernel without having to play at all with iopl()/
> ioperm() ? That would alleviate the need for these large port maps.
> Applications that use outb/inb() usually don't need extreme speeds.
> Each time I had to use them, it was to access a watchdog, a sensor, a
> fan, control a front panel LED, or read/write to NVRAM. Some userland
> drivers possibly don't need much more, and very likely run with
> privileges turned on all the time, so replacing their inb()/outb() calls
> would mostly be a matter of redefining them using a macro to use the
> syscall instead.
>
> I'd see an API more or less like this :
>
> int ioport(int op, u16 port, long val, long *ret);
Hmm. I have some memory of a /dev/ioport or similar, but now I can't
find it. It does seem quite reasonable.
But, for uses like DPDK, /sys/.../resource0 seems like a *far* better
API, since it actually uses the kernel's concept of which io range
corresponds to which device instead of hoping that the mappings don't
change out from under user code. And it has the added benefit that
it's restricted to a single device.
--Andy
On Fri, Oct 25, 2019 at 07:45:47AM -0700, Andy Lutomirski wrote:
> But, for uses like DPDK, /sys/.../resource0 seems like a *far* better
> API, since it actually uses the kernel's concept of which io range
> corresponds to which device instead of hoping that the mappings don't
> change out from under user code. And it has the added benefit that
> it's restricted to a single device.
For certain such uses with real device management, very likely yes.
It's just that in a number of programs using hard-coded ports to
access stupid devices with no driver (and often even no name), such
an approach could be overkill, and these are typically the annoyingly
itchy ones which could require your config entry to remain enabled.
I'll add to my todo list to have a look at this as time permits.
Cheers,
Willy
On Thu, 24 Oct 2019 21:45:56 -0700
Andy Lutomirski <[email protected]> wrote:
> Hi all-
>
> Supporting iopl() in the Linux kernel is becoming a maintainability
> problem. As far as I know, DPDK is the only major modern user of
> iopl().
>
> After doing some research, DPDK uses direct io port access for only a
> single purpose: accessing legacy virtio configuration structures.
> These structures are mapped in IO space in BAR 0 on legacy virtio
> devices.
Yes. Legacy virtio seems to have been designed without consideration
of how to use it in userspace. Xen, Vmware and Hyper-V all use memory
as a doorbell mechanism which is easier to use from userspace.
> There are at least three ways you could avoid using iopl(). Here they
> are in rough order of quality in my opinion:
>
> 1. Change pci_uio_ioport_read() and pci_uio_ioport_write() to use
> read() and write() on resource0 in sysfs.
The cost of entering the kernel for a doorbell mechanism is too
expensive and would kill performance.
> 2. Use the alternative access mechanism in the virtio legacy spec:
> there is a way to access all of these structures via configuration
> space.
There is no way to use memory doorbell on older versions of virtio.
Users want to run DPDK on old stuff like RHEL6 and even older
kernel forks. There are even use cases where virtio is used for
a non-Linux host; such as GCP.
> 3. Use ioperm() instead of iopl().
Ioperm has the wrong thread semantics. All DPDK applications have
multiple threads and the initialization logic needs to work even
if the thread is started later; threads can also be started by
the user application.
Iopl applies to whole process so this is not an issue.
>
>
> We are considering changes to the kernel that will potentially harm
> the performance of any program that uses iopl(3) -- in particular,
> context switches will become more expensive, and the scheduler might
> need to explicitly penalize such programs to ensure fairness. Using
> ioperm() already hurts performance, and the proposed changes to iopl()
> will make it even worse. Alternatively, the kernel could drop iopl()
> support entirely. I will certainly make a change to allow
> distributions to remove iopl() support entirely from their kernels,
> and I expect that distributions will do this.
>
> Please fix DPDK.
Please fix virtio.
On Fri, 25 Oct 2019, Stephen Hemminger wrote:
> On Thu, 24 Oct 2019 21:45:56 -0700
> Andy Lutomirski <[email protected]> wrote:
> > 3. Use ioperm() instead of iopl().
>
> Ioperm has the wrong thread semantics. All DPDK applications have
> multiple threads and the initialization logic needs to work even
> if the thread is started later; threads can also be started by
> the user application.
>
> Iopl applies to whole process so this is not an issue.
No. iopl is also per thread and not per process. That has been that way
forever. The man page is blantantly wrong.
Both iopl and ioperm are inherited on fork.
Thanks,
tglx
> On Oct 25, 2019, at 9:13 AM, Stephen Hemminger <[email protected]> wrote:
>
> On Thu, 24 Oct 2019 21:45:56 -0700
> Andy Lutomirski <[email protected]> wrote:
>
>> Hi all-
>>
>> Supporting iopl() in the Linux kernel is becoming a maintainability
>> problem. As far as I know, DPDK is the only major modern user of
>> iopl().
>>
>> After doing some research, DPDK uses direct io port access for only a
>> single purpose: accessing legacy virtio configuration structures.
>> These structures are mapped in IO space in BAR 0 on legacy virtio
>> devices.
>
> Yes. Legacy virtio seems to have been designed without consideration
> of how to use it in userspace. Xen, Vmware and Hyper-V all use memory
> as a doorbell mechanism which is easier to use from userspace.
>
>
>> There are at least three ways you could avoid using iopl(). Here they
>> are in rough order of quality in my opinion:
>>
>> 1. Change pci_uio_ioport_read() and pci_uio_ioport_write() to use
>> read() and write() on resource0 in sysfs.
>
> The cost of entering the kernel for a doorbell mechanism is too
> expensive and would kill performance.
>
>
>> 2. Use the alternative access mechanism in the virtio legacy spec:
>> there is a way to access all of these structures via configuration
>> space.
>
> There is no way to use memory doorbell on older versions of virtio.
> Users want to run DPDK on old stuff like RHEL6 and even older
> kernel forks. There are even use cases where virtio is used for
> a non-Linux host; such as GCP.
>
>
>> 3. Use ioperm() instead of iopl().
>
> Ioperm has the wrong thread semantics. All DPDK applications have
> multiple threads and the initialization logic needs to work even
> if the thread is started later; threads can also be started by
> the user application.
>
> Iopl applies to whole process so this is not an issue.
This is not true. ioperm() and iopl() have identical thread semantics.
I think what you’re seeing is that you can set iopl(3) early without
knowing which port range to request. You could alternatively set
ioperm() early and ask for a very wide range. In principle, we could
make ioperm() be per thread, but I’m not sure we should add that kind
of complexity to support a mostly obsolete use case like this.
There's actually an argument to be made that per-mm ioperm would be
easier to handle in the kernel than per-task due to the vagaries of
KPTI.
All this being said, what are the actual performance implications of
write() to /sys/.../resource0? Off the top of my head, I would guess
that the actual OUTB or OUTL instruction itself is incredibly slow due
to being trapped and emulated and that virtio-legacy hypervisors
aren't particularly fast to begin with and that, as a result, the
write() might not actually matter that much.
>
>>
>>
>> We are considering changes to the kernel that will potentially harm
>> the performance of any program that uses iopl(3) -- in particular,
>> context switches will become more expensive, and the scheduler might
>> need to explicitly penalize such programs to ensure fairness. Using
>> ioperm() already hurts performance, and the proposed changes to iopl()
>> will make it even worse. Alternatively, the kernel could drop iopl()
>> support entirely. I will certainly make a change to allow
>> distributions to remove iopl() support entirely from their kernels,
>> and I expect that distributions will do this.
>>
>> Please fix DPDK.
>
> Please fix virtio.
Done, with the new version of virtio :)
On Fri, 25 Oct 2019, Andy Lutomirski wrote:
> > I'd see an API more or less like this :
> >
> > int ioport(int op, u16 port, long val, long *ret);
>
> Hmm. I have some memory of a /dev/ioport or similar, but now I can't
> find it. It does seem quite reasonable.
crw-r----- 1 root kmem 1, 4 Sep 9 13:58 /dev/port
Maciej
On Fri, 25 Oct 2019 08:42:25 +0200
Willy Tarreau <[email protected]> wrote:
> Hi Andy,
>
> On Thu, Oct 24, 2019 at 09:45:56PM -0700, Andy Lutomirski wrote:
> > Hi all-
> >
> > Supporting iopl() in the Linux kernel is becoming a maintainability
> > problem. As far as I know, DPDK is the only major modern user of
> > iopl().
> >
> > After doing some research, DPDK uses direct io port access for only a
> > single purpose: accessing legacy virtio configuration structures.
> > These structures are mapped in IO space in BAR 0 on legacy virtio
> > devices.
> >
> > There are at least three ways you could avoid using iopl(). Here they
> > are in rough order of quality in my opinion:
> (...)
>
> I'm just wondering, why wouldn't we introduce a sys_ioport() syscall
> to perform I/Os in the kernel without having to play at all with iopl()/
> ioperm() ? That would alleviate the need for these large port maps.
> Applications that use outb/inb() usually don't need extreme speeds.
> Each time I had to use them, it was to access a watchdog, a sensor, a
> fan, control a front panel LED, or read/write to NVRAM. Some userland
> drivers possibly don't need much more, and very likely run with
> privileges turned on all the time, so replacing their inb()/outb() calls
> would mostly be a matter of redefining them using a macro to use the
> syscall instead.
>
> I'd see an API more or less like this :
>
> int ioport(int op, u16 port, long val, long *ret);
>
> <op> would take values such as INB,INW,INL to fill *<ret>, OUTB,OUTW,OUL
> to read from <val>, possibly ORB,ORW,ORL to read, or with <val>, write
> back and return previous value to <ret>, ANDB/W/L, XORB/W/L to do the
> same with and/xor, and maybe a TEST operation to just validate support
> at start time and replace ioperm/iopl so that subsequent calls do not
> need to check for errors. Applications could then replace :
>
> ioperm() with ioport(TEST,port,0,0)
> iopl() with ioport(TEST,0,0,0)
> outb() with ioport(OUTB,port,val,0)
> inb() with ({ char val;ioport(INB,port,0,&val);val;})
>
> ... and so on.
>
> And then ioperm/iopl can easily be dropped.
>
> Maybe I'm overlooking something ?
> Willy
DPDK does not want to system calls. It kills performance.
With pure user mode access it can reach > 10 Million Packets/sec
with a system call per packet that drops to 1 Million Packets/sec.
Also, adding new system calls might help in the long term,
but users are often kernels that are at least 5 years behind
upstream.
> On Oct 28, 2019, at 10:43 AM, Stephen Hemminger <[email protected]> wrote:
>
> On Fri, 25 Oct 2019 08:42:25 +0200
> Willy Tarreau <[email protected]> wrote:
>
>> Hi Andy,
>>
>>> On Thu, Oct 24, 2019 at 09:45:56PM -0700, Andy Lutomirski wrote:
>>> Hi all-
>>>
>>> Supporting iopl() in the Linux kernel is becoming a maintainability
>>> problem. As far as I know, DPDK is the only major modern user of
>>> iopl().
>>>
>>> After doing some research, DPDK uses direct io port access for only a
>>> single purpose: accessing legacy virtio configuration structures.
>>> These structures are mapped in IO space in BAR 0 on legacy virtio
>>> devices.
>>>
>>> There are at least three ways you could avoid using iopl(). Here they
>>> are in rough order of quality in my opinion:
>> (...)
>>
>> I'm just wondering, why wouldn't we introduce a sys_ioport() syscall
>> to perform I/Os in the kernel without having to play at all with iopl()/
>> ioperm() ? That would alleviate the need for these large port maps.
>> Applications that use outb/inb() usually don't need extreme speeds.
>> Each time I had to use them, it was to access a watchdog, a sensor, a
>> fan, control a front panel LED, or read/write to NVRAM. Some userland
>> drivers possibly don't need much more, and very likely run with
>> privileges turned on all the time, so replacing their inb()/outb() calls
>> would mostly be a matter of redefining them using a macro to use the
>> syscall instead.
>>
>> I'd see an API more or less like this :
>>
>> int ioport(int op, u16 port, long val, long *ret);
>>
>> <op> would take values such as INB,INW,INL to fill *<ret>, OUTB,OUTW,OUL
>> to read from <val>, possibly ORB,ORW,ORL to read, or with <val>, write
>> back and return previous value to <ret>, ANDB/W/L, XORB/W/L to do the
>> same with and/xor, and maybe a TEST operation to just validate support
>> at start time and replace ioperm/iopl so that subsequent calls do not
>> need to check for errors. Applications could then replace :
>>
>> ioperm() with ioport(TEST,port,0,0)
>> iopl() with ioport(TEST,0,0,0)
>> outb() with ioport(OUTB,port,val,0)
>> inb() with ({ char val;ioport(INB,port,0,&val);val;})
>>
>> ... and so on.
>>
>> And then ioperm/iopl can easily be dropped.
>>
>> Maybe I'm overlooking something ?
>> Willy
>
> DPDK does not want to system calls. It kills performance.
> With pure user mode access it can reach > 10 Million Packets/sec
> with a system call per packet that drops to 1 Million Packets/sec.
If you are getting 10 MPPS with an OUT per packet, I’ll buy you a
whole case of beer.
I’m suggesting that, on virtio-legacy, you benchmark the performance
hit of using a syscall to ring the doorbell. Right now, you're doing
an OUT instruction that traps to the hypervisor, probably gets
emulated, and goes out to whatever host-side driver is running. The
cost of doing that is going to be quite high, especially on older
machines. I'm guessing that adding a syscall to the mix won't make
much difference.
--Andy
Hi Stephen,
On Mon, Oct 28, 2019 at 09:42:53AM -0700, Stephen Hemminger wrote:
(...)
> > I'd see an API more or less like this :
> >
> > int ioport(int op, u16 port, long val, long *ret);
> >
> > <op> would take values such as INB,INW,INL to fill *<ret>, OUTB,OUTW,OUL
> > to read from <val>, possibly ORB,ORW,ORL to read, or with <val>, write
> > back and return previous value to <ret>, ANDB/W/L, XORB/W/L to do the
> > same with and/xor, and maybe a TEST operation to just validate support
> > at start time and replace ioperm/iopl so that subsequent calls do not
> > need to check for errors. Applications could then replace :
> >
> > ioperm() with ioport(TEST,port,0,0)
> > iopl() with ioport(TEST,0,0,0)
> > outb() with ioport(OUTB,port,val,0)
> > inb() with ({ char val;ioport(INB,port,0,&val);val;})
> >
> > ... and so on.
> >
> > And then ioperm/iopl can easily be dropped.
> >
> > Maybe I'm overlooking something ?
> > Willy
>
> DPDK does not want to system calls. It kills performance.
> With pure user mode access it can reach > 10 Million Packets/sec
> with a system call per packet that drops to 1 Million Packets/sec.
I know that it would cause this on the data path, but are you *really*
sure that in/out calls are performed there, because these are terribly
slow already ? I'd suspect that instead it's relying on read/write of
memory-mapped registers and descriptors. I really suspect that I/Os
are only used for configuration purposes, which is why I proposed the
stuff above (otherwise I obviously agree that syscalls in the data
path are performance killers).
> Also, adding new system calls might help in the long term,
> but users are often kernels that are at least 5 years behind
> upstream.
Sure but that has never been really an issue, what matters is that
backwards compatibility is long enough to let old features smoothly
fade away. Some people make fun of me because I still care a bit
about kernel 2.4 and openssl 0.9.7 compatibility for haproxy, so
yes, I am careful about backwards compatibility and smooth upgrades ;-)
Willy