2024-03-21 17:55:19

by Jonathan Cameron

[permalink] [raw]
Subject: RFC: Restricting userspace interfaces for CXL fabric management

Hi All,

This is has come up in a number of discussions both on list and in private,
so I wanted to lay out a potential set of rules when deciding whether or not
to provide a user space interface for a particular feature of CXL Fabric
Management. The intent is to drive discussion, not to simply tell people
a set of rules. I've brought this to the public lists as it's a Linux kernel
policy discussion, not a standards one.

Whilst I'm writing the RFC this my attempt to summarize a possible
position rather than necessarily being my personal view.

It's a straw man - shoot at it!

Not everyone in this discussion is familiar with relevant kernel or CXL concepts
so I've provided more info than I normally would.

First some background:
======================

CXL has two different types of Fabric. The comments here refer to both, but
for now the kernel stack is focused on the simpler VCS fabric, not the more
recent Port Based Routing (PBR) Fabrics. A typical example for 2 hosts
connected to a common switch looks something like:

________________ _______________
| | | | Hosts - each sees
| HOST A | | HOST B | a PCIe style tree
| | | | but from a fabric config
| |Root Port| | | |Root Port| | point of view it's more
-------|-------- -------|------- complex.
| |
| |
_______|______________________________|________
| USP (SW-CCI) USP | Switch can have lots of
| | | | Upstream Ports. Each one
| ____|________ _______|______ | has a virtual hierarchy.
| | | | | |
| vPPB vPPB vPPB vPPB| There are virtual
| x | | | | "downstream ports."(vPPBs)
| \ / / | That can be bound to real
| \ / / | downstream ports.
| \ / / |
| \ / / | Multi Logical Devices are
| DSP0 DSP1 DSP 2 | support more than one vPPB
------------------------------------------------ bound to a single physical
| | | DSP (transactions are tagged
| | | with an LD-ID)
SLD0 MLD0 SLD1

Some typical fabric management activities:
1) Bind/Unbind vPPB to physical DSP (Results in hotplug / unplug events)
2) Access config space or BAR space of End Points below the switch.
3) Tunneling messages through to devices downstream (e.g Dynamic Capacity
Forced Remove that will blow away some memory even if a host is using it).
4) Non destructive stuff like status read back.

Given the hosts may be using the Type 3 hosted memory (either Single Logical
Device - SLD, or an LD on a Multi logical Device - MLD) as normal memory,
unbinding a device in use can result in the memory access from a
different host being removed. The 'blast radius' is perhaps a rack of
servers. This discussion applies equally to FM-API commands sent to Multi
Head Devices (see CXL r3.1).

The Fabric Management actions are done using the CXL spec defined Fabric
Management API, (FM-API) which is transported over various means including
OoB MCTP over your favourite transport (I2C, PCIe-VDM...) or via normal
PCIe read/write to a Switch-CCI. A Switch-CCI is mailbox in PCI BAR
space on a function found alongside one of the switch upstream ports;
this mailbox is very similar to the MMPT definition found in PCIe r6.2.

In many cases this switch CCI / MCTP connection is used by a BMC rather
than a normal host, but there have been some questions raised about whether
a general purpose server OS would have a valid reason to use this interface
(beyond debug and testing) to configure the switch or an MHD.

If people have a use case for this, please reply to this thread to give
more details.

The most recently posted CXL Switch-CCI support only provided the RAW CXL
command IOCTL interface that is already available for Type 3 memory devices.
That allows for unfettered control of the switch but, because it is
extremely easy to shoot yourself in the foot and cause unsolvable bug reports,
it taints the kernel. There have been several requests to provide this interface
without the taint for these switch configuration mailboxes.

Last posted series:
https://lore.kernel.org/all/[email protected]/
Note there are unrelated reasons why that code hasn't been updated since v6.6 time,
but I am planning to get back to it shortly.

Similar issues will occur for other uses of PCIe MMPT (new mailbox in PCI that
sometimes is used for similarly destructive activity such as PLDM based
firmware update).


On to the proposed rules:

1) Kernel space use of the various mailboxes, or filtered controls from user space.
==================================================================================

Absolutely fine - no one worries about this, but the mediated traffic will
be filtered for potentially destructive side effects. E.g. it will reject
attempts to change anything routing related if the kernel either knows a host is
using memory that will be blown away, or has no way to know (so affecting
routing to another host). This includes blocking 'all' vendor defined
messages as we have no idea what the do. Note this means the kernel has
an allow list and new commands are not initially allowed.

This isn't currently enabled for Switch CCIs because they are only really
interesting if the potentially destructive stuff is available (an earlier
version did enable query commands, but it wasn't particularly useful to
know what your switch could do but not be allowed to do any of it).
If you take a MMPT usecase of PLDM firmware update, the filtering would
check that the device was in a state where a firmware update won't rip
memory out from under a host, which would be messy if that host is
doing the update.

2) Unfiltered userspace use of mailbox for Fabric Management - BMC kernels
==========================================================================

(This would just be a kernel option that we'd advise normal server
distributions not to turn on. Would be enabled by openBMC etc)

This is fine - there is some work to do, but the switch-cci PCI driver
will hopefully be ready for upstream merge soon. There is no filtering of
accesses. Think of this as similar to all the damage you can do via
MCTP from a BMC. Similarly it is likely that much of the complexity
of the actual commands will be left to user space tooling:
https://gitlab.com/jic23/cxl-fmapi-tests has some test examples.

Whether Kconfig help text is strong enough to ensure this only gets
enabled for BMC targeted distros is an open question we can address
alongside an updated patch set.

(On to the one that the "debate" is about)

3) Unfiltered user space use of mailbox for Fabric Management - Distro kernels
=============================================================================
(General purpose Linux Server Distro (Redhat, Suse etc))

This is equivalent of RAW command support on CXL Type 3 memory devices.
You can enable those in a distro kernel build despite the scary config
help text, but if you use it the kernel is tainted. The result
of the taint is to add a flag to bug reports and print a big message to say
that you've used a feature that might result in you shooting yourself
in the foot.

The taint is there because software is not at first written to deal with
everything that can happen smoothly (e.g. surprise removal) It's hard
to survive some of these events, so is never on the initial feature list
for any bus, so this flag is just to indicate we have entered a world
where almost all bets are off wrt to stability. We might not know what
a command does so we can't assess the impact (and no one trusts vendor
commands to report affects right in the Command Effects Log - which
in theory tells you if a command can result problems).

A concern was raised about GAE/FAST/LDST tables for CXL Fabrics
(a r3.1 feature) but, as I understand it, these are intended for a
host to configure and should not have side effects on other hosts?
My working assumption is that the kernel driver stack will handle
these (once we catch up with the current feature backlog!) Currently
we have no visibility of what the OS driver stack for a fabrics will
actually look like - the spec is just the starting point for that.
(patches welcome ;)

The various CXL upstream developers and maintainers may have
differing views of course, but my current understanding is we want
to support 1 and 2, but are very resistant to 3!

General Notes
=============

One side aspect of why we really don't like unfiltered userspace access to any
of these devices is that people start building non standard hacks in and we
lose the ecosystem advantages. Forcing a considered discussion + patches
to let a particular command be supported, drives standardization.

https://lore.kernel.org/linux-cxl/CAPcyv4gDShAYih5iWabKg_eTHhuHm54vEAei8ZkcmHnPp3B0cw@mail.gmail.com/
provides some history on vendor specific extensions and why in general we
won't support them upstream.

To address another question raised in an earlier discussion:
Putting these Fabric Management interfaces behind guard rails of some type
(e.g. CONFIG_IM_A_BMC_AND_CAN_MAKE_A_MESS) does not encourage the risk
of non standard interfaces, because we will be even less likely to accept
those upstream!

If anyone needs more details on any aspect of this please ask.
There are a lot of things involved and I've only tried to give a fairly
minimal illustration to drive the discussion. I may well have missed
something crucial.

Jonathan



2024-03-21 21:41:45

by Sreenivas Bagalkote

[permalink] [raw]
Subject: Re: RFC: Restricting userspace interfaces for CXL fabric management

Thank you for kicking off this discussion, Jonathan.

We need guidance from the community.

1. Datacenter customers must be able to manage PCIe switches in-band.
2. Management of switches includes getting health, performance, and error
telemetry.
3. These telemetry functions are not yet part of the CXL standard
4. We built the CCI mailboxes into our PCIe switches per CXL spec and
developed our management scheme around them.

If the Linux community does not allow a CXL spec-compliant switch to be
managed via the CXL spec-defined CCI mailbox, then please guide us on
the right approach. Please tell us how you propose we manage our switches
in-band.

Thank you
Sreeni

On Thu, Mar 21, 2024 at 10:44 AM Jonathan Cameron <
[email protected]> wrote:

> Hi All,
>
> This is has come up in a number of discussions both on list and in private,
> so I wanted to lay out a potential set of rules when deciding whether or
> not
> to provide a user space interface for a particular feature of CXL Fabric
> Management. The intent is to drive discussion, not to simply tell people
> a set of rules. I've brought this to the public lists as it's a Linux
> kernel
> policy discussion, not a standards one.
>
> Whilst I'm writing the RFC this my attempt to summarize a possible
> position rather than necessarily being my personal view.
>
> It's a straw man - shoot at it!
>
> Not everyone in this discussion is familiar with relevant kernel or CXL
> concepts
> so I've provided more info than I normally would.
>
> First some background:
> ======================
>
> CXL has two different types of Fabric. The comments here refer to both, but
> for now the kernel stack is focused on the simpler VCS fabric, not the more
> recent Port Based Routing (PBR) Fabrics. A typical example for 2 hosts
> connected to a common switch looks something like:
>
> ________________ _______________
> | | | | Hosts - each sees
> | HOST A | | HOST B | a PCIe style tree
> | | | | but from a fabric
> config
> | |Root Port| | | |Root Port| | point of view it's more
> -------|-------- -------|------- complex.
> | |
> | |
> _______|______________________________|________
> | USP (SW-CCI) USP | Switch can have lots of
> | | | | Upstream Ports. Each one
> | ____|________ _______|______ | has a virtual hierarchy.
> | | | | | |
> | vPPB vPPB vPPB vPPB| There are virtual
> | x | | | | "downstream
> ports."(vPPBs)
> | \ / / | That can be bound to real
> | \ / / | downstream ports.
> | \ / / |
> | \ / / | Multi Logical Devices are
> | DSP0 DSP1 DSP 2 | support more than one
> vPPB
> ------------------------------------------------ bound to a single
> physical
> | | | DSP (transactions are
> tagged
> | | | with an LD-ID)
> SLD0 MLD0 SLD1
>
> Some typical fabric management activities:
> 1) Bind/Unbind vPPB to physical DSP (Results in hotplug / unplug events)
> 2) Access config space or BAR space of End Points below the switch.
> 3) Tunneling messages through to devices downstream (e.g Dynamic Capacity
> Forced Remove that will blow away some memory even if a host is using
> it).
> 4) Non destructive stuff like status read back.
>
> Given the hosts may be using the Type 3 hosted memory (either Single
> Logical
> Device - SLD, or an LD on a Multi logical Device - MLD) as normal memory,
> unbinding a device in use can result in the memory access from a
> different host being removed. The 'blast radius' is perhaps a rack of
> servers. This discussion applies equally to FM-API commands sent to Multi
> Head Devices (see CXL r3.1).
>
> The Fabric Management actions are done using the CXL spec defined Fabric
> Management API, (FM-API) which is transported over various means including
> OoB MCTP over your favourite transport (I2C, PCIe-VDM...) or via normal
> PCIe read/write to a Switch-CCI. A Switch-CCI is mailbox in PCI BAR
> space on a function found alongside one of the switch upstream ports;
> this mailbox is very similar to the MMPT definition found in PCIe r6.2.
>
> In many cases this switch CCI / MCTP connection is used by a BMC rather
> than a normal host, but there have been some questions raised about whether
> a general purpose server OS would have a valid reason to use this interface
> (beyond debug and testing) to configure the switch or an MHD.
>
> If people have a use case for this, please reply to this thread to give
> more details.
>
> The most recently posted CXL Switch-CCI support only provided the RAW CXL
> command IOCTL interface that is already available for Type 3 memory
> devices.
> That allows for unfettered control of the switch but, because it is
> extremely easy to shoot yourself in the foot and cause unsolvable bug
> reports,
> it taints the kernel. There have been several requests to provide this
> interface
> without the taint for these switch configuration mailboxes.
>
> Last posted series:
>
> https://lore.kernel.org/all/[email protected]/
> Note there are unrelated reasons why that code hasn't been updated since
> v6.6 time,
> but I am planning to get back to it shortly.
>
> Similar issues will occur for other uses of PCIe MMPT (new mailbox in PCI
> that
> sometimes is used for similarly destructive activity such as PLDM based
> firmware update).
>
>
> On to the proposed rules:
>
> 1) Kernel space use of the various mailboxes, or filtered controls from
> user space.
>
> ==================================================================================
>
> Absolutely fine - no one worries about this, but the mediated traffic will
> be filtered for potentially destructive side effects. E.g. it will reject
> attempts to change anything routing related if the kernel either knows a
> host is
> using memory that will be blown away, or has no way to know (so affecting
> routing to another host). This includes blocking 'all' vendor defined
> messages as we have no idea what the do. Note this means the kernel has
> an allow list and new commands are not initially allowed.
>
> This isn't currently enabled for Switch CCIs because they are only really
> interesting if the potentially destructive stuff is available (an earlier
> version did enable query commands, but it wasn't particularly useful to
> know what your switch could do but not be allowed to do any of it).
> If you take a MMPT usecase of PLDM firmware update, the filtering would
> check that the device was in a state where a firmware update won't rip
> memory out from under a host, which would be messy if that host is
> doing the update.
>
> 2) Unfiltered userspace use of mailbox for Fabric Management - BMC kernels
> ==========================================================================
>
> (This would just be a kernel option that we'd advise normal server
> distributions not to turn on. Would be enabled by openBMC etc)
>
> This is fine - there is some work to do, but the switch-cci PCI driver
> will hopefully be ready for upstream merge soon. There is no filtering of
> accesses. Think of this as similar to all the damage you can do via
> MCTP from a BMC. Similarly it is likely that much of the complexity
> of the actual commands will be left to user space tooling:
> https://gitlab.com/jic23/cxl-fmapi-tests has some test examples.
>
> Whether Kconfig help text is strong enough to ensure this only gets
> enabled for BMC targeted distros is an open question we can address
> alongside an updated patch set.
>
> (On to the one that the "debate" is about)
>
> 3) Unfiltered user space use of mailbox for Fabric Management - Distro
> kernels
>
> =============================================================================
> (General purpose Linux Server Distro (Redhat, Suse etc))
>
> This is equivalent of RAW command support on CXL Type 3 memory devices.
> You can enable those in a distro kernel build despite the scary config
> help text, but if you use it the kernel is tainted. The result
> of the taint is to add a flag to bug reports and print a big message to say
> that you've used a feature that might result in you shooting yourself
> in the foot.
>
> The taint is there because software is not at first written to deal with
> everything that can happen smoothly (e.g. surprise removal) It's hard
> to survive some of these events, so is never on the initial feature list
> for any bus, so this flag is just to indicate we have entered a world
> where almost all bets are off wrt to stability. We might not know what
> a command does so we can't assess the impact (and no one trusts vendor
> commands to report affects right in the Command Effects Log - which
> in theory tells you if a command can result problems).
>
> A concern was raised about GAE/FAST/LDST tables for CXL Fabrics
> (a r3.1 feature) but, as I understand it, these are intended for a
> host to configure and should not have side effects on other hosts?
> My working assumption is that the kernel driver stack will handle
> these (once we catch up with the current feature backlog!) Currently
> we have no visibility of what the OS driver stack for a fabrics will
> actually look like - the spec is just the starting point for that.
> (patches welcome ;)
>
> The various CXL upstream developers and maintainers may have
> differing views of course, but my current understanding is we want
> to support 1 and 2, but are very resistant to 3!
>
> General Notes
> =============
>
> One side aspect of why we really don't like unfiltered userspace access to
> any
> of these devices is that people start building non standard hacks in and we
> lose the ecosystem advantages. Forcing a considered discussion + patches
> to let a particular command be supported, drives standardization.
>
>
> https://lore.kernel.org/linux-cxl/CAPcyv4gDShAYih5iWabKg_eTHhuHm54vEAei8ZkcmHnPp3B0cw@mail.gmail.com/
> provides some history on vendor specific extensions and why in general we
> won't support them upstream.
>
> To address another question raised in an earlier discussion:
> Putting these Fabric Management interfaces behind guard rails of some type
> (e.g. CONFIG_IM_A_BMC_AND_CAN_MAKE_A_MESS) does not encourage the risk
> of non standard interfaces, because we will be even less likely to accept
> those upstream!
>
> If anyone needs more details on any aspect of this please ask.
> There are a lot of things involved and I've only tried to give a fairly
> minimal illustration to drive the discussion. I may well have missed
> something crucial.
>
> Jonathan
>
>

--
This electronic communication and the information and any files transmitted
with it, or attached to it, are confidential and are intended solely for
the use of the individual or entity to whom it is addressed and may contain
information that is confidential, legally privileged, protected by privacy
laws, or otherwise restricted from disclosure to anyone else. If you are
not the intended recipient or the person responsible for delivering the
e-mail to the intended recipient, you are hereby notified that any use,
copying, distributing, dissemination, forwarding, printing, or copying of
this e-mail is strictly prohibited. If you received this e-mail in error,
please return the e-mail to the sender, delete it from your computer, and
destroy any printed copy of it.


Attachments:
smime.p7s (4.13 kB)
S/MIME Cryptographic Signature

2024-03-22 09:38:53

by Jonathan Cameron

[permalink] [raw]
Subject: Re: RFC: Restricting userspace interfaces for CXL fabric management

On Thu, 21 Mar 2024 14:41:00 -0700
Sreenivas Bagalkote <[email protected]> wrote:

> Thank you for kicking off this discussion, Jonathan.

Hi Sreenivas,

>
> We need guidance from the community.
>
> 1. Datacenter customers must be able to manage PCIe switches in-band.

What is the use case? My understanding so far is that clouds and
similar sometimes use an in band path but it would be from a management
only host, not a general purpose host running other software. Sure
that control host just connects to a different upstream port so, from
a switch point of view, it's the same as any other host. From a host
software point of view it's not running general cloud workloads or
(at least in most cases) a general purpose OS distribution.

This is the key question behind this discussion.

> 2. Management of switches includes getting health, performance, and error
> telemetry.

For telemetry(subject to any odd corners like commands that might lock
the interface up for a long time, which we've seen with commands in the
Spec!) I don't see any problem supporting those on all host software.
They should be non destructive to other hosts etc.

> 3. These telemetry functions are not yet part of the CXL standard

Ok, so this we should try to pin down the boundaries around this.
The thread linked below lays out the reasoning behind a general rule
of not accepting vendor defined commands, but perhaps there are routes
to answer some of those concerns.

'Maybe' if you were to publish a specification for those particular
vendor defined commands, it might be fine to add them to the allow list
for the switch-cci. Key here is that Broadcom would be committing to not
using those particular opcodes from the vendor space for anything else
in the future (so we could match on VID + opcode). This is similar to
some DVSEC usage in PCIe (and why DVSEC is different from VSEC).

Effectively you'd be publishing an additional specification building on CXL.
Those are expected to surface anyway from various standards orgs - should
we treat a company published one differently? I don't see why.
Exactly how this would work might take some figuring out (in main code,
separate driver module etc?)

That specification would be expected to provide a similar level of detail
to CXL spec defined commands (ideally the less vague ones, but meh, up to
you as long as any side effects are clearly documented!)

Speaking for myself, I'd consider this approach.
Particularly true if I see clear effort in the standards org to push
these into future specifications as that shows broadcom are trying to
enhance the ecosystems.


> 4. We built the CCI mailboxes into our PCIe switches per CXL spec and
> developed our management scheme around them.
>
> If the Linux community does not allow a CXL spec-compliant switch to be
> managed via the CXL spec-defined CCI mailbox, then please guide us on
> the right approach. Please tell us how you propose we manage our switches
> in-band.

The Linux community is fine supporting this in the kernel (the BMC or
Fabric Management only host case - option 2 below, so the code will be there)
the question here is what advice we offer to the general purpose
distributions and what protections we need to put in place to mitigate the
'blast radius' concerns.

Jonathan
>
> Thank you
> Sreeni
>
> On Thu, Mar 21, 2024 at 10:44 AM Jonathan Cameron <
> [email protected]> wrote:
>
> > Hi All,
> >
> > This is has come up in a number of discussions both on list and in private,
> > so I wanted to lay out a potential set of rules when deciding whether or
> > not
> > to provide a user space interface for a particular feature of CXL Fabric
> > Management. The intent is to drive discussion, not to simply tell people
> > a set of rules. I've brought this to the public lists as it's a Linux
> > kernel
> > policy discussion, not a standards one.
> >
> > Whilst I'm writing the RFC this my attempt to summarize a possible
> > position rather than necessarily being my personal view.
> >
> > It's a straw man - shoot at it!
> >
> > Not everyone in this discussion is familiar with relevant kernel or CXL
> > concepts
> > so I've provided more info than I normally would.
> >
> > First some background:
> > ======================
> >
> > CXL has two different types of Fabric. The comments here refer to both, but
> > for now the kernel stack is focused on the simpler VCS fabric, not the more
> > recent Port Based Routing (PBR) Fabrics. A typical example for 2 hosts
> > connected to a common switch looks something like:
> >
> > ________________ _______________
> > | | | | Hosts - each sees
> > | HOST A | | HOST B | a PCIe style tree
> > | | | | but from a fabric
> > config
> > | |Root Port| | | |Root Port| | point of view it's more
> > -------|-------- -------|------- complex.
> > | |
> > | |
> > _______|______________________________|________
> > | USP (SW-CCI) USP | Switch can have lots of
> > | | | | Upstream Ports. Each one
> > | ____|________ _______|______ | has a virtual hierarchy.
> > | | | | | |
> > | vPPB vPPB vPPB vPPB| There are virtual
> > | x | | | | "downstream
> > ports."(vPPBs)
> > | \ / / | That can be bound to real
> > | \ / / | downstream ports.
> > | \ / / |
> > | \ / / | Multi Logical Devices are
> > | DSP0 DSP1 DSP 2 | support more than one
> > vPPB
> > ------------------------------------------------ bound to a single
> > physical
> > | | | DSP (transactions are
> > tagged
> > | | | with an LD-ID)
> > SLD0 MLD0 SLD1
> >
> > Some typical fabric management activities:
> > 1) Bind/Unbind vPPB to physical DSP (Results in hotplug / unplug events)
> > 2) Access config space or BAR space of End Points below the switch.
> > 3) Tunneling messages through to devices downstream (e.g Dynamic Capacity
> > Forced Remove that will blow away some memory even if a host is using
> > it).
> > 4) Non destructive stuff like status read back.
> >
> > Given the hosts may be using the Type 3 hosted memory (either Single
> > Logical
> > Device - SLD, or an LD on a Multi logical Device - MLD) as normal memory,
> > unbinding a device in use can result in the memory access from a
> > different host being removed. The 'blast radius' is perhaps a rack of
> > servers. This discussion applies equally to FM-API commands sent to Multi
> > Head Devices (see CXL r3.1).
> >
> > The Fabric Management actions are done using the CXL spec defined Fabric
> > Management API, (FM-API) which is transported over various means including
> > OoB MCTP over your favourite transport (I2C, PCIe-VDM...) or via normal
> > PCIe read/write to a Switch-CCI. A Switch-CCI is mailbox in PCI BAR
> > space on a function found alongside one of the switch upstream ports;
> > this mailbox is very similar to the MMPT definition found in PCIe r6.2.
> >
> > In many cases this switch CCI / MCTP connection is used by a BMC rather
> > than a normal host, but there have been some questions raised about whether
> > a general purpose server OS would have a valid reason to use this interface
> > (beyond debug and testing) to configure the switch or an MHD.
> >
> > If people have a use case for this, please reply to this thread to give
> > more details.
> >
> > The most recently posted CXL Switch-CCI support only provided the RAW CXL
> > command IOCTL interface that is already available for Type 3 memory
> > devices.
> > That allows for unfettered control of the switch but, because it is
> > extremely easy to shoot yourself in the foot and cause unsolvable bug
> > reports,
> > it taints the kernel. There have been several requests to provide this
> > interface
> > without the taint for these switch configuration mailboxes.
> >
> > Last posted series:
> >
> > https://lore.kernel.org/all/[email protected]/
> > Note there are unrelated reasons why that code hasn't been updated since
> > v6.6 time,
> > but I am planning to get back to it shortly.
> >
> > Similar issues will occur for other uses of PCIe MMPT (new mailbox in PCI
> > that
> > sometimes is used for similarly destructive activity such as PLDM based
> > firmware update).
> >
> >
> > On to the proposed rules:
> >
> > 1) Kernel space use of the various mailboxes, or filtered controls from
> > user space.
> >
> > ==================================================================================
> >
> > Absolutely fine - no one worries about this, but the mediated traffic will
> > be filtered for potentially destructive side effects. E.g. it will reject
> > attempts to change anything routing related if the kernel either knows a
> > host is
> > using memory that will be blown away, or has no way to know (so affecting
> > routing to another host). This includes blocking 'all' vendor defined
> > messages as we have no idea what the do. Note this means the kernel has
> > an allow list and new commands are not initially allowed.
> >
> > This isn't currently enabled for Switch CCIs because they are only really
> > interesting if the potentially destructive stuff is available (an earlier
> > version did enable query commands, but it wasn't particularly useful to
> > know what your switch could do but not be allowed to do any of it).
> > If you take a MMPT usecase of PLDM firmware update, the filtering would
> > check that the device was in a state where a firmware update won't rip
> > memory out from under a host, which would be messy if that host is
> > doing the update.
> >
> > 2) Unfiltered userspace use of mailbox for Fabric Management - BMC kernels
> > ==========================================================================
> >
> > (This would just be a kernel option that we'd advise normal server
> > distributions not to turn on. Would be enabled by openBMC etc)
> >
> > This is fine - there is some work to do, but the switch-cci PCI driver
> > will hopefully be ready for upstream merge soon. There is no filtering of
> > accesses. Think of this as similar to all the damage you can do via
> > MCTP from a BMC. Similarly it is likely that much of the complexity
> > of the actual commands will be left to user space tooling:
> > https://gitlab.com/jic23/cxl-fmapi-tests has some test examples.
> >
> > Whether Kconfig help text is strong enough to ensure this only gets
> > enabled for BMC targeted distros is an open question we can address
> > alongside an updated patch set.
> >
> > (On to the one that the "debate" is about)
> >
> > 3) Unfiltered user space use of mailbox for Fabric Management - Distro
> > kernels
> >
> > =============================================================================
> > (General purpose Linux Server Distro (Redhat, Suse etc))
> >
> > This is equivalent of RAW command support on CXL Type 3 memory devices.
> > You can enable those in a distro kernel build despite the scary config
> > help text, but if you use it the kernel is tainted. The result
> > of the taint is to add a flag to bug reports and print a big message to say
> > that you've used a feature that might result in you shooting yourself
> > in the foot.
> >
> > The taint is there because software is not at first written to deal with
> > everything that can happen smoothly (e.g. surprise removal) It's hard
> > to survive some of these events, so is never on the initial feature list
> > for any bus, so this flag is just to indicate we have entered a world
> > where almost all bets are off wrt to stability. We might not know what
> > a command does so we can't assess the impact (and no one trusts vendor
> > commands to report affects right in the Command Effects Log - which
> > in theory tells you if a command can result problems).
> >
> > A concern was raised about GAE/FAST/LDST tables for CXL Fabrics
> > (a r3.1 feature) but, as I understand it, these are intended for a
> > host to configure and should not have side effects on other hosts?
> > My working assumption is that the kernel driver stack will handle
> > these (once we catch up with the current feature backlog!) Currently
> > we have no visibility of what the OS driver stack for a fabrics will
> > actually look like - the spec is just the starting point for that.
> > (patches welcome ;)
> >
> > The various CXL upstream developers and maintainers may have
> > differing views of course, but my current understanding is we want
> > to support 1 and 2, but are very resistant to 3!
> >
> > General Notes
> > =============
> >
> > One side aspect of why we really don't like unfiltered userspace access to
> > any
> > of these devices is that people start building non standard hacks in and we
> > lose the ecosystem advantages. Forcing a considered discussion + patches
> > to let a particular command be supported, drives standardization.
> >
> >
> > https://lore.kernel.org/linux-cxl/CAPcyv4gDShAYih5iWabKg_eTHhuHm54vEAei8ZkcmHnPp3B0cw@mail.gmail.com/
> > provides some history on vendor specific extensions and why in general we
> > won't support them upstream.
> >
> > To address another question raised in an earlier discussion:
> > Putting these Fabric Management interfaces behind guard rails of some type
> > (e.g. CONFIG_IM_A_BMC_AND_CAN_MAKE_A_MESS) does not encourage the risk
> > of non standard interfaces, because we will be even less likely to accept
> > those upstream!
> >
> > If anyone needs more details on any aspect of this please ask.
> > There are a lot of things involved and I've only tried to give a fairly
> > minimal illustration to drive the discussion. I may well have missed
> > something crucial.
> >
> > Jonathan
> >
> >
>


2024-03-22 13:25:21

by Sreenivas Bagalkote

[permalink] [raw]
Subject: Re: RFC: Restricting userspace interfaces for CXL fabric management

Jonathan,

>
> What is the use case? My understanding so far is that clouds and
> similar sometimes use an in band path but it would be from a management
> only host, not a general purpose host running other software
>

The overwhelming majority of the PCIe switches get deployed in a single
server. Typically four to eight switches are connected to two or more root
complexes in one or two CPUs. The deployment scenario you have in mind -
multiple physical hosts running general workloads and a management-only
host - exists. But it is insignificant.

>
> For telemetry(subject to any odd corners like commands that might lock
> the interface up for a long time, which we've seen with commands in the
> Spec!) I don't see any problem supporting those on all host software.
> They should be non destructive to other hosts etc.
>

Thank you. As you do this, please keep in mind that your concern about not
affecting "other" hosts is theoretically valid but doesn't exist in the
real world beyond science experiments. If there are real-world deployments,
they are insignificant. I urge you all to make your stuff work with 99.99%
of the deployments.

>
> 'Maybe' if you were to publish a specification for those particular
> vendor defined commands, it might be fine to add them to the allow list
> for the switch-cci.
>

Your proposal sounds reasonable. I will let you all experts figure out how
to support the vendor-defined commands. CXL spec has them for a reason and
they need to be supported.

Sreeni

On Fri, Mar 22, 2024 at 2:32 AM Jonathan Cameron <
[email protected]> wrote:

> On Thu, 21 Mar 2024 14:41:00 -0700
> Sreenivas Bagalkote <[email protected]> wrote:
>
> > Thank you for kicking off this discussion, Jonathan.
>
> Hi Sreenivas,
>
> >
> > We need guidance from the community.
> >
> > 1. Datacenter customers must be able to manage PCIe switches in-band.
>
> What is the use case? My understanding so far is that clouds and
> similar sometimes use an in band path but it would be from a management
> only host, not a general purpose host running other software. Sure
> that control host just connects to a different upstream port so, from
> a switch point of view, it's the same as any other host. From a host
> software point of view it's not running general cloud workloads or
> (at least in most cases) a general purpose OS distribution.
>
> This is the key question behind this discussion.
>
> > 2. Management of switches includes getting health, performance, and error
> > telemetry.
>
> For telemetry(subject to any odd corners like commands that might lock
> the interface up for a long time, which we've seen with commands in the
> Spec!) I don't see any problem supporting those on all host software.
> They should be non destructive to other hosts etc.
>
> > 3. These telemetry functions are not yet part of the CXL standard
>
> Ok, so this we should try to pin down the boundaries around this.
> The thread linked below lays out the reasoning behind a general rule
> of not accepting vendor defined commands, but perhaps there are routes
> to answer some of those concerns.
>
> 'Maybe' if you were to publish a specification for those particular
> vendor defined commands, it might be fine to add them to the allow list
> for the switch-cci. Key here is that Broadcom would be committing to not
> using those particular opcodes from the vendor space for anything else
> in the future (so we could match on VID + opcode). This is similar to
> some DVSEC usage in PCIe (and why DVSEC is different from VSEC).
>
> Effectively you'd be publishing an additional specification building on
> CXL.
> Those are expected to surface anyway from various standards orgs - should
> we treat a company published one differently? I don't see why.
> Exactly how this would work might take some figuring out (in main code,
> separate driver module etc?)
>
> That specification would be expected to provide a similar level of detail
> to CXL spec defined commands (ideally the less vague ones, but meh, up to
> you as long as any side effects are clearly documented!)
>
> Speaking for myself, I'd consider this approach.
> Particularly true if I see clear effort in the standards org to push
> these into future specifications as that shows broadcom are trying to
> enhance the ecosystems.
>
>
> > 4. We built the CCI mailboxes into our PCIe switches per CXL spec and
> > developed our management scheme around them.
> >
> > If the Linux community does not allow a CXL spec-compliant switch to be
> > managed via the CXL spec-defined CCI mailbox, then please guide us on
> > the right approach. Please tell us how you propose we manage our switches
> > in-band.
>
> The Linux community is fine supporting this in the kernel (the BMC or
> Fabric Management only host case - option 2 below, so the code will be
> there)
> the question here is what advice we offer to the general purpose
> distributions and what protections we need to put in place to mitigate the
> 'blast radius' concerns.
>
> Jonathan
> >
> > Thank you
> > Sreeni
> >
> > On Thu, Mar 21, 2024 at 10:44 AM Jonathan Cameron <
> > [email protected]> wrote:
> >
> > > Hi All,
> > >
> > > This is has come up in a number of discussions both on list and in
> private,
> > > so I wanted to lay out a potential set of rules when deciding whether
> or
> > > not
> > > to provide a user space interface for a particular feature of CXL
> Fabric
> > > Management. The intent is to drive discussion, not to simply tell
> people
> > > a set of rules. I've brought this to the public lists as it's a Linux
> > > kernel
> > > policy discussion, not a standards one.
> > >
> > > Whilst I'm writing the RFC this my attempt to summarize a possible
> > > position rather than necessarily being my personal view.
> > >
> > > It's a straw man - shoot at it!
> > >
> > > Not everyone in this discussion is familiar with relevant kernel or CXL
> > > concepts
> > > so I've provided more info than I normally would.
> > >
> > > First some background:
> > > ======================
> > >
> > > CXL has two different types of Fabric. The comments here refer to
> both, but
> > > for now the kernel stack is focused on the simpler VCS fabric, not the
> more
> > > recent Port Based Routing (PBR) Fabrics. A typical example for 2 hosts
> > > connected to a common switch looks something like:
> > >
> > > ________________ _______________
> > > | | | | Hosts - each sees
> > > | HOST A | | HOST B | a PCIe style tree
> > > | | | | but from a fabric
> > > config
> > > | |Root Port| | | |Root Port| | point of view it's
> more
> > > -------|-------- -------|------- complex.
> > > | |
> > > | |
> > > _______|______________________________|________
> > > | USP (SW-CCI) USP | Switch can have lots
> of
> > > | | | | Upstream Ports. Each
> one
> > > | ____|________ _______|______ | has a virtual
> hierarchy.
> > > | | | | | |
> > > | vPPB vPPB vPPB vPPB| There are virtual
> > > | x | | | | "downstream
> > > ports."(vPPBs)
> > > | \ / / | That can be bound to
> real
> > > | \ / / | downstream ports.
> > > | \ / / |
> > > | \ / / | Multi Logical
> Devices are
> > > | DSP0 DSP1 DSP 2 | support more than one
> > > vPPB
> > > ------------------------------------------------ bound to a single
> > > physical
> > > | | | DSP (transactions are
> > > tagged
> > > | | | with an LD-ID)
> > > SLD0 MLD0 SLD1
> > >
> > > Some typical fabric management activities:
> > > 1) Bind/Unbind vPPB to physical DSP (Results in hotplug / unplug
> events)
> > > 2) Access config space or BAR space of End Points below the switch.
> > > 3) Tunneling messages through to devices downstream (e.g Dynamic
> Capacity
> > > Forced Remove that will blow away some memory even if a host is
> using
> > > it).
> > > 4) Non destructive stuff like status read back.
> > >
> > > Given the hosts may be using the Type 3 hosted memory (either Single
> > > Logical
> > > Device - SLD, or an LD on a Multi logical Device - MLD) as normal
> memory,
> > > unbinding a device in use can result in the memory access from a
> > > different host being removed. The 'blast radius' is perhaps a rack of
> > > servers. This discussion applies equally to FM-API commands sent to
> Multi
> > > Head Devices (see CXL r3.1).
> > >
> > > The Fabric Management actions are done using the CXL spec defined
> Fabric
> > > Management API, (FM-API) which is transported over various means
> including
> > > OoB MCTP over your favourite transport (I2C, PCIe-VDM...) or via normal
> > > PCIe read/write to a Switch-CCI. A Switch-CCI is mailbox in PCI BAR
> > > space on a function found alongside one of the switch upstream ports;
> > > this mailbox is very similar to the MMPT definition found in PCIe r6.2.
> > >
> > > In many cases this switch CCI / MCTP connection is used by a BMC rather
> > > than a normal host, but there have been some questions raised about
> whether
> > > a general purpose server OS would have a valid reason to use this
> interface
> > > (beyond debug and testing) to configure the switch or an MHD.
> > >
> > > If people have a use case for this, please reply to this thread to give
> > > more details.
> > >
> > > The most recently posted CXL Switch-CCI support only provided the RAW
> CXL
> > > command IOCTL interface that is already available for Type 3 memory
> > > devices.
> > > That allows for unfettered control of the switch but, because it is
> > > extremely easy to shoot yourself in the foot and cause unsolvable bug
> > > reports,
> > > it taints the kernel. There have been several requests to provide this
> > > interface
> > > without the taint for these switch configuration mailboxes.
> > >
> > > Last posted series:
> > >
> > >
> https://lore.kernel.org/all/[email protected]/
> > > Note there are unrelated reasons why that code hasn't been updated
> since
> > > v6.6 time,
> > > but I am planning to get back to it shortly.
> > >
> > > Similar issues will occur for other uses of PCIe MMPT (new mailbox in
> PCI
> > > that
> > > sometimes is used for similarly destructive activity such as PLDM based
> > > firmware update).
> > >
> > >
> > > On to the proposed rules:
> > >
> > > 1) Kernel space use of the various mailboxes, or filtered controls from
> > > user space.
> > >
> > >
> ==================================================================================
> > >
> > > Absolutely fine - no one worries about this, but the mediated traffic
> will
> > > be filtered for potentially destructive side effects. E.g. it will
> reject
> > > attempts to change anything routing related if the kernel either knows
> a
> > > host is
> > > using memory that will be blown away, or has no way to know (so
> affecting
> > > routing to another host). This includes blocking 'all' vendor defined
> > > messages as we have no idea what the do. Note this means the kernel
> has
> > > an allow list and new commands are not initially allowed.
> > >
> > > This isn't currently enabled for Switch CCIs because they are only
> really
> > > interesting if the potentially destructive stuff is available (an
> earlier
> > > version did enable query commands, but it wasn't particularly useful to
> > > know what your switch could do but not be allowed to do any of it).
> > > If you take a MMPT usecase of PLDM firmware update, the filtering would
> > > check that the device was in a state where a firmware update won't rip
> > > memory out from under a host, which would be messy if that host is
> > > doing the update.
> > >
> > > 2) Unfiltered userspace use of mailbox for Fabric Management - BMC
> kernels
> > >
> ==========================================================================
> > >
> > > (This would just be a kernel option that we'd advise normal server
> > > distributions not to turn on. Would be enabled by openBMC etc)
> > >
> > > This is fine - there is some work to do, but the switch-cci PCI driver
> > > will hopefully be ready for upstream merge soon. There is no filtering
> of
> > > accesses. Think of this as similar to all the damage you can do via
> > > MCTP from a BMC. Similarly it is likely that much of the complexity
> > > of the actual commands will be left to user space tooling:
> > > https://gitlab.com/jic23/cxl-fmapi-tests has some test examples.
> > >
> > > Whether Kconfig help text is strong enough to ensure this only gets
> > > enabled for BMC targeted distros is an open question we can address
> > > alongside an updated patch set.
> > >
> > > (On to the one that the "debate" is about)
> > >
> > > 3) Unfiltered user space use of mailbox for Fabric Management - Distro
> > > kernels
> > >
> > >
> =============================================================================
> > > (General purpose Linux Server Distro (Redhat, Suse etc))
> > >
> > > This is equivalent of RAW command support on CXL Type 3 memory devices.
> > > You can enable those in a distro kernel build despite the scary config
> > > help text, but if you use it the kernel is tainted. The result
> > > of the taint is to add a flag to bug reports and print a big message
> to say
> > > that you've used a feature that might result in you shooting yourself
> > > in the foot.
> > >
> > > The taint is there because software is not at first written to deal
> with
> > > everything that can happen smoothly (e.g. surprise removal) It's hard
> > > to survive some of these events, so is never on the initial feature
> list
> > > for any bus, so this flag is just to indicate we have entered a world
> > > where almost all bets are off wrt to stability. We might not know what
> > > a command does so we can't assess the impact (and no one trusts vendor
> > > commands to report affects right in the Command Effects Log - which
> > > in theory tells you if a command can result problems).
> > >
> > > A concern was raised about GAE/FAST/LDST tables for CXL Fabrics
> > > (a r3.1 feature) but, as I understand it, these are intended for a
> > > host to configure and should not have side effects on other hosts?
> > > My working assumption is that the kernel driver stack will handle
> > > these (once we catch up with the current feature backlog!) Currently
> > > we have no visibility of what the OS driver stack for a fabrics will
> > > actually look like - the spec is just the starting point for that.
> > > (patches welcome ;)
> > >
> > > The various CXL upstream developers and maintainers may have
> > > differing views of course, but my current understanding is we want
> > > to support 1 and 2, but are very resistant to 3!
> > >
> > > General Notes
> > > =============
> > >
> > > One side aspect of why we really don't like unfiltered userspace
> access to
> > > any
> > > of these devices is that people start building non standard hacks in
> and we
> > > lose the ecosystem advantages. Forcing a considered discussion +
> patches
> > > to let a particular command be supported, drives standardization.
> > >
> > >
> > >
> https://lore.kernel.org/linux-cxl/CAPcyv4gDShAYih5iWabKg_eTHhuHm54vEAei8ZkcmHnPp3B0cw@mail.gmail.com/
> > > provides some history on vendor specific extensions and why in general
> we
> > > won't support them upstream.
> > >
> > > To address another question raised in an earlier discussion:
> > > Putting these Fabric Management interfaces behind guard rails of some
> type
> > > (e.g. CONFIG_IM_A_BMC_AND_CAN_MAKE_A_MESS) does not encourage the risk
> > > of non standard interfaces, because we will be even less likely to
> accept
> > > those upstream!
> > >
> > > If anyone needs more details on any aspect of this please ask.
> > > There are a lot of things involved and I've only tried to give a fairly
> > > minimal illustration to drive the discussion. I may well have missed
> > > something crucial.
> > >
> > > Jonathan
> > >
> > >
> >
>
>

--
This electronic communication and the information and any files transmitted
with it, or attached to it, are confidential and are intended solely for
the use of the individual or entity to whom it is addressed and may contain
information that is confidential, legally privileged, protected by privacy
laws, or otherwise restricted from disclosure to anyone else. If you are
not the intended recipient or the person responsible for delivering the
e-mail to the intended recipient, you are hereby notified that any use,
copying, distributing, dissemination, forwarding, printing, or copying of
this e-mail is strictly prohibited. If you received this e-mail in error,
please return the e-mail to the sender, delete it from your computer, and
destroy any printed copy of it.


Attachments:
smime.p7s (4.13 kB)
S/MIME Cryptographic Signature

2024-04-01 16:53:08

by Sreenivas Bagalkote

[permalink] [raw]
Subject: Re: RFC: Restricting userspace interfaces for CXL fabric management

Hello Dan, Jonathan -

> >
> > 'Maybe' if you were to publish a specification for those particular
> > vendor defined commands, it might be fine to add them to the allow list
> > for the switch-cci.
> >
>
> Your proposal sounds reasonable. I will let you all experts figure out
how to support the vendor-defined commands. CXL spec has them for a reason
and they need to be supported.

Please confirm that you will support an "allow list" of the vendor-defined
commands. We will publish it once you confirm.

Thank you
Sreeni
Sreenivas Bagalkote <[email protected]>
Product Planning & Management
Broadcom Datacenter Solutions Group

On Fri, Mar 22, 2024 at 6:54 PM Sreenivas Bagalkote <
[email protected]> wrote:

> Jonathan,
>
> >
> > What is the use case? My understanding so far is that clouds and
> > similar sometimes use an in band path but it would be from a management
> > only host, not a general purpose host running other software
> >
>
> The overwhelming majority of the PCIe switches get deployed in a single
> server. Typically four to eight switches are connected to two or more root
> complexes in one or two CPUs. The deployment scenario you have in mind -
> multiple physical hosts running general workloads and a management-only
> host - exists. But it is insignificant.
>
> >
> > For telemetry(subject to any odd corners like commands that might lock
> > the interface up for a long time, which we've seen with commands in the
> > Spec!) I don't see any problem supporting those on all host software.
> > They should be non destructive to other hosts etc.
> >
>
> Thank you. As you do this, please keep in mind that your concern about not
> affecting "other" hosts is theoretically valid but doesn't exist in the
> real world beyond science experiments. If there are real-world deployments,
> they are insignificant. I urge you all to make your stuff work with 99.99%
> of the deployments.
>
> >
> > 'Maybe' if you were to publish a specification for those particular
> > vendor defined commands, it might be fine to add them to the allow list
> > for the switch-cci.
> >
>
> Your proposal sounds reasonable. I will let you all experts figure out how
> to support the vendor-defined commands. CXL spec has them for a reason and
> they need to be supported.
>
> Sreeni
>
> On Fri, Mar 22, 2024 at 2:32 AM Jonathan Cameron <
> [email protected]> wrote:
>
>> On Thu, 21 Mar 2024 14:41:00 -0700
>> Sreenivas Bagalkote <[email protected]> wrote:
>>
>> > Thank you for kicking off this discussion, Jonathan.
>>
>> Hi Sreenivas,
>>
>> >
>> > We need guidance from the community.
>> >
>> > 1. Datacenter customers must be able to manage PCIe switches in-band.
>>
>> What is the use case? My understanding so far is that clouds and
>> similar sometimes use an in band path but it would be from a management
>> only host, not a general purpose host running other software. Sure
>> that control host just connects to a different upstream port so, from
>> a switch point of view, it's the same as any other host. From a host
>> software point of view it's not running general cloud workloads or
>> (at least in most cases) a general purpose OS distribution.
>>
>> This is the key question behind this discussion.
>>
>> > 2. Management of switches includes getting health, performance, and
>> error
>> > telemetry.
>>
>> For telemetry(subject to any odd corners like commands that might lock
>> the interface up for a long time, which we've seen with commands in the
>> Spec!) I don't see any problem supporting those on all host software.
>> They should be non destructive to other hosts etc.
>>
>> > 3. These telemetry functions are not yet part of the CXL standard
>>
>> Ok, so this we should try to pin down the boundaries around this.
>> The thread linked below lays out the reasoning behind a general rule
>> of not accepting vendor defined commands, but perhaps there are routes
>> to answer some of those concerns.
>>
>> 'Maybe' if you were to publish a specification for those particular
>> vendor defined commands, it might be fine to add them to the allow list
>> for the switch-cci. Key here is that Broadcom would be committing to not
>> using those particular opcodes from the vendor space for anything else
>> in the future (so we could match on VID + opcode). This is similar to
>> some DVSEC usage in PCIe (and why DVSEC is different from VSEC).
>>
>> Effectively you'd be publishing an additional specification building on
>> CXL.
>> Those are expected to surface anyway from various standards orgs - should
>> we treat a company published one differently? I don't see why.
>> Exactly how this would work might take some figuring out (in main code,
>> separate driver module etc?)
>>
>> That specification would be expected to provide a similar level of detail
>> to CXL spec defined commands (ideally the less vague ones, but meh, up to
>> you as long as any side effects are clearly documented!)
>>
>> Speaking for myself, I'd consider this approach.
>> Particularly true if I see clear effort in the standards org to push
>> these into future specifications as that shows broadcom are trying to
>> enhance the ecosystems.
>>
>>
>> > 4. We built the CCI mailboxes into our PCIe switches per CXL spec and
>> > developed our management scheme around them.
>> >
>> > If the Linux community does not allow a CXL spec-compliant switch to be
>> > managed via the CXL spec-defined CCI mailbox, then please guide us on
>> > the right approach. Please tell us how you propose we manage our
>> switches
>> > in-band.
>>
>> The Linux community is fine supporting this in the kernel (the BMC or
>> Fabric Management only host case - option 2 below, so the code will be
>> there)
>> the question here is what advice we offer to the general purpose
>> distributions and what protections we need to put in place to mitigate the
>> 'blast radius' concerns.
>>
>> Jonathan
>> >
>> > Thank you
>> > Sreeni
>> >
>> > On Thu, Mar 21, 2024 at 10:44 AM Jonathan Cameron <
>> > [email protected]> wrote:
>> >
>> > > Hi All,
>> > >
>> > > This is has come up in a number of discussions both on list and in
>> private,
>> > > so I wanted to lay out a potential set of rules when deciding whether
>> or
>> > > not
>> > > to provide a user space interface for a particular feature of CXL
>> Fabric
>> > > Management. The intent is to drive discussion, not to simply tell
>> people
>> > > a set of rules. I've brought this to the public lists as it's a Linux
>> > > kernel
>> > > policy discussion, not a standards one.
>> > >
>> > > Whilst I'm writing the RFC this my attempt to summarize a possible
>> > > position rather than necessarily being my personal view.
>> > >
>> > > It's a straw man - shoot at it!
>> > >
>> > > Not everyone in this discussion is familiar with relevant kernel or
>> CXL
>> > > concepts
>> > > so I've provided more info than I normally would.
>> > >
>> > > First some background:
>> > > ======================
>> > >
>> > > CXL has two different types of Fabric. The comments here refer to
>> both, but
>> > > for now the kernel stack is focused on the simpler VCS fabric, not
>> the more
>> > > recent Port Based Routing (PBR) Fabrics. A typical example for 2 hosts
>> > > connected to a common switch looks something like:
>> > >
>> > > ________________ _______________
>> > > | | | | Hosts - each sees
>> > > | HOST A | | HOST B | a PCIe style tree
>> > > | | | | but from a fabric
>> > > config
>> > > | |Root Port| | | |Root Port| | point of view
>> it's more
>> > > -------|-------- -------|------- complex.
>> > > | |
>> > > | |
>> > > _______|______________________________|________
>> > > | USP (SW-CCI) USP | Switch can have
>> lots of
>> > > | | | | Upstream Ports.
>> Each one
>> > > | ____|________ _______|______ | has a virtual
>> hierarchy.
>> > > | | | | | |
>> > > | vPPB vPPB vPPB vPPB| There are virtual
>> > > | x | | | | "downstream
>> > > ports."(vPPBs)
>> > > | \ / / | That can be bound
>> to real
>> > > | \ / / | downstream ports.
>> > > | \ / / |
>> > > | \ / / | Multi Logical
>> Devices are
>> > > | DSP0 DSP1 DSP 2 | support more than
>> one
>> > > vPPB
>> > > ------------------------------------------------ bound to a single
>> > > physical
>> > > | | | DSP (transactions
>> are
>> > > tagged
>> > > | | | with an LD-ID)
>> > > SLD0 MLD0 SLD1
>> > >
>> > > Some typical fabric management activities:
>> > > 1) Bind/Unbind vPPB to physical DSP (Results in hotplug / unplug
>> events)
>> > > 2) Access config space or BAR space of End Points below the switch.
>> > > 3) Tunneling messages through to devices downstream (e.g Dynamic
>> Capacity
>> > > Forced Remove that will blow away some memory even if a host is
>> using
>> > > it).
>> > > 4) Non destructive stuff like status read back.
>> > >
>> > > Given the hosts may be using the Type 3 hosted memory (either Single
>> > > Logical
>> > > Device - SLD, or an LD on a Multi logical Device - MLD) as normal
>> memory,
>> > > unbinding a device in use can result in the memory access from a
>> > > different host being removed. The 'blast radius' is perhaps a rack of
>> > > servers. This discussion applies equally to FM-API commands sent to
>> Multi
>> > > Head Devices (see CXL r3.1).
>> > >
>> > > The Fabric Management actions are done using the CXL spec defined
>> Fabric
>> > > Management API, (FM-API) which is transported over various means
>> including
>> > > OoB MCTP over your favourite transport (I2C, PCIe-VDM...) or via
>> normal
>> > > PCIe read/write to a Switch-CCI. A Switch-CCI is mailbox in PCI BAR
>> > > space on a function found alongside one of the switch upstream ports;
>> > > this mailbox is very similar to the MMPT definition found in PCIe
>> r6.2.
>> > >
>> > > In many cases this switch CCI / MCTP connection is used by a BMC
>> rather
>> > > than a normal host, but there have been some questions raised about
>> whether
>> > > a general purpose server OS would have a valid reason to use this
>> interface
>> > > (beyond debug and testing) to configure the switch or an MHD.
>> > >
>> > > If people have a use case for this, please reply to this thread to
>> give
>> > > more details.
>> > >
>> > > The most recently posted CXL Switch-CCI support only provided the RAW
>> CXL
>> > > command IOCTL interface that is already available for Type 3 memory
>> > > devices.
>> > > That allows for unfettered control of the switch but, because it is
>> > > extremely easy to shoot yourself in the foot and cause unsolvable bug
>> > > reports,
>> > > it taints the kernel. There have been several requests to provide this
>> > > interface
>> > > without the taint for these switch configuration mailboxes.
>> > >
>> > > Last posted series:
>> > >
>> > >
>> https://lore.kernel.org/all/[email protected]/
>> > > Note there are unrelated reasons why that code hasn't been updated
>> since
>> > > v6.6 time,
>> > > but I am planning to get back to it shortly.
>> > >
>> > > Similar issues will occur for other uses of PCIe MMPT (new mailbox in
>> PCI
>> > > that
>> > > sometimes is used for similarly destructive activity such as PLDM
>> based
>> > > firmware update).
>> > >
>> > >
>> > > On to the proposed rules:
>> > >
>> > > 1) Kernel space use of the various mailboxes, or filtered controls
>> from
>> > > user space.
>> > >
>> > >
>> ==================================================================================
>> > >
>> > > Absolutely fine - no one worries about this, but the mediated traffic
>> will
>> > > be filtered for potentially destructive side effects. E.g. it will
>> reject
>> > > attempts to change anything routing related if the kernel either
>> knows a
>> > > host is
>> > > using memory that will be blown away, or has no way to know (so
>> affecting
>> > > routing to another host). This includes blocking 'all' vendor defined
>> > > messages as we have no idea what the do. Note this means the kernel
>> has
>> > > an allow list and new commands are not initially allowed.
>> > >
>> > > This isn't currently enabled for Switch CCIs because they are only
>> really
>> > > interesting if the potentially destructive stuff is available (an
>> earlier
>> > > version did enable query commands, but it wasn't particularly useful
>> to
>> > > know what your switch could do but not be allowed to do any of it).
>> > > If you take a MMPT usecase of PLDM firmware update, the filtering
>> would
>> > > check that the device was in a state where a firmware update won't rip
>> > > memory out from under a host, which would be messy if that host is
>> > > doing the update.
>> > >
>> > > 2) Unfiltered userspace use of mailbox for Fabric Management - BMC
>> kernels
>> > >
>> ==========================================================================
>> > >
>> > > (This would just be a kernel option that we'd advise normal server
>> > > distributions not to turn on. Would be enabled by openBMC etc)
>> > >
>> > > This is fine - there is some work to do, but the switch-cci PCI driver
>> > > will hopefully be ready for upstream merge soon. There is no
>> filtering of
>> > > accesses. Think of this as similar to all the damage you can do via
>> > > MCTP from a BMC. Similarly it is likely that much of the complexity
>> > > of the actual commands will be left to user space tooling:
>> > > https://gitlab.com/jic23/cxl-fmapi-tests has some test examples.
>> > >
>> > > Whether Kconfig help text is strong enough to ensure this only gets
>> > > enabled for BMC targeted distros is an open question we can address
>> > > alongside an updated patch set.
>> > >
>> > > (On to the one that the "debate" is about)
>> > >
>> > > 3) Unfiltered user space use of mailbox for Fabric Management - Distro
>> > > kernels
>> > >
>> > >
>> =============================================================================
>> > > (General purpose Linux Server Distro (Redhat, Suse etc))
>> > >
>> > > This is equivalent of RAW command support on CXL Type 3 memory
>> devices.
>> > > You can enable those in a distro kernel build despite the scary config
>> > > help text, but if you use it the kernel is tainted. The result
>> > > of the taint is to add a flag to bug reports and print a big message
>> to say
>> > > that you've used a feature that might result in you shooting yourself
>> > > in the foot.
>> > >
>> > > The taint is there because software is not at first written to deal
>> with
>> > > everything that can happen smoothly (e.g. surprise removal) It's hard
>> > > to survive some of these events, so is never on the initial feature
>> list
>> > > for any bus, so this flag is just to indicate we have entered a world
>> > > where almost all bets are off wrt to stability. We might not know
>> what
>> > > a command does so we can't assess the impact (and no one trusts vendor
>> > > commands to report affects right in the Command Effects Log - which
>> > > in theory tells you if a command can result problems).
>> > >
>> > > A concern was raised about GAE/FAST/LDST tables for CXL Fabrics
>> > > (a r3.1 feature) but, as I understand it, these are intended for a
>> > > host to configure and should not have side effects on other hosts?
>> > > My working assumption is that the kernel driver stack will handle
>> > > these (once we catch up with the current feature backlog!) Currently
>> > > we have no visibility of what the OS driver stack for a fabrics will
>> > > actually look like - the spec is just the starting point for that.
>> > > (patches welcome ;)
>> > >
>> > > The various CXL upstream developers and maintainers may have
>> > > differing views of course, but my current understanding is we want
>> > > to support 1 and 2, but are very resistant to 3!
>> > >
>> > > General Notes
>> > > =============
>> > >
>> > > One side aspect of why we really don't like unfiltered userspace
>> access to
>> > > any
>> > > of these devices is that people start building non standard hacks in
>> and we
>> > > lose the ecosystem advantages. Forcing a considered discussion +
>> patches
>> > > to let a particular command be supported, drives standardization.
>> > >
>> > >
>> > >
>> https://lore.kernel.org/linux-cxl/CAPcyv4gDShAYih5iWabKg_eTHhuHm54vEAei8ZkcmHnPp3B0cw@mail.gmail.com/
>> > > provides some history on vendor specific extensions and why in
>> general we
>> > > won't support them upstream.
>> > >
>> > > To address another question raised in an earlier discussion:
>> > > Putting these Fabric Management interfaces behind guard rails of some
>> type
>> > > (e.g. CONFIG_IM_A_BMC_AND_CAN_MAKE_A_MESS) does not encourage the risk
>> > > of non standard interfaces, because we will be even less likely to
>> accept
>> > > those upstream!
>> > >
>> > > If anyone needs more details on any aspect of this please ask.
>> > > There are a lot of things involved and I've only tried to give a
>> fairly
>> > > minimal illustration to drive the discussion. I may well have missed
>> > > something crucial.
>> > >
>> > > Jonathan
>> > >
>> > >
>> >
>>
>>

--
This electronic communication and the information and any files transmitted
with it, or attached to it, are confidential and are intended solely for
the use of the individual or entity to whom it is addressed and may contain
information that is confidential, legally privileged, protected by privacy
laws, or otherwise restricted from disclosure to anyone else. If you are
not the intended recipient or the person responsible for delivering the
e-mail to the intended recipient, you are hereby notified that any use,
copying, distributing, dissemination, forwarding, printing, or copying of
this e-mail is strictly prohibited. If you received this e-mail in error,
please return the e-mail to the sender, delete it from your computer, and
destroy any printed copy of it.


Attachments:
smime.p7s (4.13 kB)
S/MIME Cryptographic Signature

2024-04-06 00:04:48

by Dan Williams

[permalink] [raw]
Subject: RE: RFC: Restricting userspace interfaces for CXL fabric management

Jonathan Cameron wrote:
> Hi All,
>
> This is has come up in a number of discussions both on list and in private,
> so I wanted to lay out a potential set of rules when deciding whether or not
> to provide a user space interface for a particular feature of CXL Fabric
> Management. The intent is to drive discussion, not to simply tell people
> a set of rules. I've brought this to the public lists as it's a Linux kernel
> policy discussion, not a standards one.
>
> Whilst I'm writing the RFC this my attempt to summarize a possible
> position rather than necessarily being my personal view.
>
> It's a straw man - shoot at it!
>
> Not everyone in this discussion is familiar with relevant kernel or CXL concepts
> so I've provided more info than I normally would.

Thanks for writing this up Jonathan!

[..]
> 2) Unfiltered userspace use of mailbox for Fabric Management - BMC kernels
> ==========================================================================
>
> (This would just be a kernel option that we'd advise normal server
> distributions not to turn on. Would be enabled by openBMC etc)
>
> This is fine - there is some work to do, but the switch-cci PCI driver
> will hopefully be ready for upstream merge soon. There is no filtering of
> accesses. Think of this as similar to all the damage you can do via
> MCTP from a BMC. Similarly it is likely that much of the complexity
> of the actual commands will be left to user space tooling:
> https://gitlab.com/jic23/cxl-fmapi-tests has some test examples.
>
> Whether Kconfig help text is strong enough to ensure this only gets
> enabled for BMC targeted distros is an open question we can address
> alongside an updated patch set.

It is not clear to me that this material makes sense to house in
drivers/ vs tools/ or even out-of-tree just for maintenance burden
relief of keeping the universes separated. What does the Linux kernel
project get out of carrying this in mainline alongside the inband code?

I do think the mailbox refactoring to support non-CXL use cases is
interesting, but only so far as refactoring is consumed for inband use
cases like RAS API.

> (On to the one that the "debate" is about)
>
> 3) Unfiltered user space use of mailbox for Fabric Management - Distro kernels
> =============================================================================
> (General purpose Linux Server Distro (Redhat, Suse etc))
>
> This is equivalent of RAW command support on CXL Type 3 memory devices.
> You can enable those in a distro kernel build despite the scary config
> help text, but if you use it the kernel is tainted. The result
> of the taint is to add a flag to bug reports and print a big message to say
> that you've used a feature that might result in you shooting yourself
> in the foot.
>
> The taint is there because software is not at first written to deal with
> everything that can happen smoothly (e.g. surprise removal) It's hard
> to survive some of these events, so is never on the initial feature list
> for any bus, so this flag is just to indicate we have entered a world
> where almost all bets are off wrt to stability. We might not know what
> a command does so we can't assess the impact (and no one trusts vendor
> commands to report affects right in the Command Effects Log - which
> in theory tells you if a command can result problems).

That is a secondary reason that the taint is there. Yes, it helps
upstream not waste their time on bug reports from proprietary use cases,
but the effect of that is to make "raw" command mode unattractive for
deploying solutions at scale. It clarifies that this interface is a
debug-tool that enterprise environment need not worry about.

The more salient reason for the taint, speaking only for myself as a
Linux kernel community member not for $employer, is to encourage open
collaboration. Take firmware-update for example that is a standard
command with known side effects that is inaccessible via the ioctl()
path. It is placed behind an ABI that is easier to maintain and reason
about. Everyone has the firmware update tool if they have the 'cat'
command. Distros appreciate the fact that they do not need ship yet
another vendor device-update tool, vendors get free tooling and end
users also appreciate one flow for all devices.

As I alluded here [1], I am not against innovation outside of the
specification, but it needs to be open, and it needs to plausibly become
if not a de jure standard at least a de facto standard.

[1]: https://lore.kernel.org/all/CAPcyv4gDShAYih5iWabKg_eTHhuHm54vEAei8ZkcmHnPp3B0cw@mail.gmail.com/

> A concern was raised about GAE/FAST/LDST tables for CXL Fabrics
> (a r3.1 feature) but, as I understand it, these are intended for a
> host to configure and should not have side effects on other hosts?
> My working assumption is that the kernel driver stack will handle
> these (once we catch up with the current feature backlog!) Currently
> we have no visibility of what the OS driver stack for a fabrics will
> actually look like - the spec is just the starting point for that.
> (patches welcome ;)
>
> The various CXL upstream developers and maintainers may have
> differing views of course, but my current understanding is we want
> to support 1 and 2, but are very resistant to 3!

1, yes, 2, need to see the patches, and agree on 3.

> General Notes
> =============
>
> One side aspect of why we really don't like unfiltered userspace access to any
> of these devices is that people start building non standard hacks in and we
> lose the ecosystem advantages. Forcing a considered discussion + patches
> to let a particular command be supported, drives standardization.

Like I said above, I think this is not a side aspect. It is fundamental
to the viability Linux as a project. This project only works because
organizations with competing goals realize they need some common
infrastructure and that there is little to be gained by competing on the
commons.

> https://lore.kernel.org/linux-cxl/CAPcyv4gDShAYih5iWabKg_eTHhuHm54vEAei8ZkcmHnPp3B0cw@mail.gmail.com/
> provides some history on vendor specific extensions and why in general we
> won't support them upstream.

Oh, you linked my writeup... I will leave the commentary I added here in case
restating it helps.

> To address another question raised in an earlier discussion:
> Putting these Fabric Management interfaces behind guard rails of some type
> (e.g. CONFIG_IM_A_BMC_AND_CAN_MAKE_A_MESS) does not encourage the risk
> of non standard interfaces, because we will be even less likely to accept
> those upstream!
>
> If anyone needs more details on any aspect of this please ask.
> There are a lot of things involved and I've only tried to give a fairly
> minimal illustration to drive the discussion. I may well have missed
> something crucial.

You captured it well, and this is open source so I may have missed
something crucial as well.

2024-04-10 11:46:58

by Jonathan Cameron

[permalink] [raw]
Subject: Re: RFC: Restricting userspace interfaces for CXL fabric management

On Fri, 5 Apr 2024 17:04:34 -0700
Dan Williams <[email protected]> wrote:

Hi Dan,

> Jonathan Cameron wrote:
> > Hi All,
> >
> > This is has come up in a number of discussions both on list and in private,
> > so I wanted to lay out a potential set of rules when deciding whether or not
> > to provide a user space interface for a particular feature of CXL Fabric
> > Management. The intent is to drive discussion, not to simply tell people
> > a set of rules. I've brought this to the public lists as it's a Linux kernel
> > policy discussion, not a standards one.
> >
> > Whilst I'm writing the RFC this my attempt to summarize a possible
> > position rather than necessarily being my personal view.
> >
> > It's a straw man - shoot at it!
> >
> > Not everyone in this discussion is familiar with relevant kernel or CXL concepts
> > so I've provided more info than I normally would.
>
> Thanks for writing this up Jonathan!
>
> [..]
> > 2) Unfiltered userspace use of mailbox for Fabric Management - BMC kernels
> > ==========================================================================
> >
> > (This would just be a kernel option that we'd advise normal server
> > distributions not to turn on. Would be enabled by openBMC etc)
> >
> > This is fine - there is some work to do, but the switch-cci PCI driver
> > will hopefully be ready for upstream merge soon. There is no filtering of
> > accesses. Think of this as similar to all the damage you can do via
> > MCTP from a BMC. Similarly it is likely that much of the complexity
> > of the actual commands will be left to user space tooling:
> > https://gitlab.com/jic23/cxl-fmapi-tests has some test examples.
> >
> > Whether Kconfig help text is strong enough to ensure this only gets
> > enabled for BMC targeted distros is an open question we can address
> > alongside an updated patch set.
>
> It is not clear to me that this material makes sense to house in
> drivers/ vs tools/ or even out-of-tree just for maintenance burden
> relief of keeping the universes separated. What does the Linux kernel
> project get out of carrying this in mainline alongside the inband code?

I'm not sure what you mean by in band. Aim here was to discuss
in-band drivers for switch CCI etc. Same reason from a kernel point of
view for why we include embedded drivers. I'll interpret in band
as host driven and not inband as FM-API stuff.

> I do think the mailbox refactoring to support non-CXL use cases is
> interesting, but only so far as refactoring is consumed for inband use
> cases like RAS API.

If I read this right, I disagree with the 'only so far' bit.

In all substantial ways we should support BMC use case of the Linux Kernel
at a similar level to how we support forms of Linux Distros. It may
not be our target market as developers for particular parts of our companies,
but we should not block those who want to support it.

We should support them in drivers/ - maybe with example userspace code
in tools. Linux distros on BMCs is a big market, there are a number
of different distros using (and in some cases contributing to) the
upstream kernel. Not everyone is using openBMC so there is not one
common place where downstream patches could be carried.
From a personal point of view, I like that for the same reasons that
I like there being multiple Linux sever focused distros. It's a sign
of a healthy ecosystem to have diverse options taking the mainline
kernel as their starting point.

BMCs are just another embedded market, and like other embedded markets
we want to encourage upstream first etc.
openBMC has a policy on this:
https://github.com/openbmc/docs/blob/master/kernel-development.md
"The OpenBMC project maintains a kernel tree for use by the project.
The tree's general development policy is that code must be upstream
first." There are paths to bypass that for openBMC so it's a little
more relaxed than some enterprise distros (today, their policies used
to look very similar to this) but we should not be telling
them they need to carry support downstream. If we are
going to tell them that, we need to be able to point at a major
sticking point for maintenance burden. So far I don't see the
additional complexity as remotely close reaching that bar.

So I think we do want switch-cci support and for that matter the equivalent
for MHDs in the upstream kernel.

One place I think there is some wiggle room is the taint on use of raw
commands. Leaving removal of that for BMC kernels as a patch they need
to carry downstream doesn't seem too burdensome. I'm sure they'll push
back if it is a problem for them! So I think we can kick that question
into the future.

Addressing maintenance burden, there is a question of where we split
the stack. Ignore MHDs for now (I won't go into why in this forum...)

The current proposal is (simplified to ignore some sharing in lookup code etc
that I can rip out if we think it might be a long term problem)

_____________ _____________________
| | | |
| Switch CCI | | Type 3 Driver stack|
|_____________| |_____________________|
|___________________________| Whatever GPU etc
_______|_______ _______|______
| | | |
| CXL MBOX | | RAS API etc |
|_______________| |______________|
|_____________________________|
|
_________|______
| |
| MMPT mbox |
|________________|

Switch CCI Driver: PCI driver doing everything beyond the CXL mbox specific bit.
Type 3 Stack: All the normal stack just with the CXL Mailbox specific stuff factored
out. Note we can move different amounts of shared logic in here, but
in essence it deals with the extra layer on top of the raw MMPT mbox.
MMPT Mbox: Mailbox as per the PCI spec.
RAS API: Shared RAS API specific infrastructure used by other drivers.

If we see a significant maintenance burden, maybe we duplicate the CXL specific
MBOX layer - I can see advantages in that as there is some stuff not relevant
to the Switch CCI. There will be some duplication of logic however such
as background command support (which is CXL only IIUC) We can even use
a difference IOCTL number so the two can diverge if needed in the long run.

e.g. If it makes it easier to get upstream, we can merrily duplicated code
so that only the bit common with RAS API etc is shared (assuming the
actually end up with MMPT, not the CXL mailbox which is what their current
publicly available spec talks about and I assume is a pref MMPT left over?)

_____________ _____________________
| | | |
| Switch CCI | | Type 3 Driver stack|
|_____________| |_____________________|
| | Whatever GPU etc
_______|_______ _______|_______ ______|_______
| | | | | |
| CXL MBOX | | CXL MBOX | | RAS API etc |
|_______________| |_______________| |______________|
|_____________________________|____________________|
|
________|______
| |
| MMPT mbox |
|_______________|


> > (On to the one that the "debate" is about)
> >
> > 3) Unfiltered user space use of mailbox for Fabric Management - Distro kernels
> > =============================================================================
> > (General purpose Linux Server Distro (Redhat, Suse etc))
> >
> > This is equivalent of RAW command support on CXL Type 3 memory devices.
> > You can enable those in a distro kernel build despite the scary config
> > help text, but if you use it the kernel is tainted. The result
> > of the taint is to add a flag to bug reports and print a big message to say
> > that you've used a feature that might result in you shooting yourself
> > in the foot.
> >
> > The taint is there because software is not at first written to deal with
> > everything that can happen smoothly (e.g. surprise removal) It's hard
> > to survive some of these events, so is never on the initial feature list
> > for any bus, so this flag is just to indicate we have entered a world
> > where almost all bets are off wrt to stability. We might not know what
> > a command does so we can't assess the impact (and no one trusts vendor
> > commands to report affects right in the Command Effects Log - which
> > in theory tells you if a command can result problems).
>
> That is a secondary reason that the taint is there. Yes, it helps
> upstream not waste their time on bug reports from proprietary use cases,
> but the effect of that is to make "raw" command mode unattractive for
> deploying solutions at scale. It clarifies that this interface is a
> debug-tool that enterprise environment need not worry about.
>
> The more salient reason for the taint, speaking only for myself as a
> Linux kernel community member not for $employer, is to encourage open
> collaboration. Take firmware-update for example that is a standard
> command with known side effects that is inaccessible via the ioctl()
> path. It is placed behind an ABI that is easier to maintain and reason
> about. Everyone has the firmware update tool if they have the 'cat'
> command. Distros appreciate the fact that they do not need ship yet
> another vendor device-update tool, vendors get free tooling and end
> users also appreciate one flow for all devices.
>
> As I alluded here [1], I am not against innovation outside of the
> specification, but it needs to be open, and it needs to plausibly become
> if not a de jure standard at least a de facto standard.
>
> [1]: https://lore.kernel.org/all/CAPcyv4gDShAYih5iWabKg_eTHhuHm54vEAei8ZkcmHnPp3B0cw@mail.gmail.com/

Agree with all this.

>
> > A concern was raised about GAE/FAST/LDST tables for CXL Fabrics
> > (a r3.1 feature) but, as I understand it, these are intended for a
> > host to configure and should not have side effects on other hosts?
> > My working assumption is that the kernel driver stack will handle
> > these (once we catch up with the current feature backlog!) Currently
> > we have no visibility of what the OS driver stack for a fabrics will
> > actually look like - the spec is just the starting point for that.
> > (patches welcome ;)
> >
> > The various CXL upstream developers and maintainers may have
> > differing views of course, but my current understanding is we want
> > to support 1 and 2, but are very resistant to 3!
>
> 1, yes, 2, need to see the patches, and agree on 3.

If we end up with top architecture of the diagrams above, 2 will look pretty
similar to last version of the switch-cci patches. So raw commands only + taint.
Factoring out MMPT is another layer that doesn't make that much difference in
practice to this discussion. Good to have, but the reuse here would be one layer
above that.

Or we just say go for second proposed architecture and 0 impact on the
CXL specific code, just reuse of the MMPT layer. I'd imagine people will get
grumpy on code duplication (and we'll spend years rejecting patch sets that
try to share the cdoe) but there should be no maintenance burden as
a result.

>
> > General Notes
> > =============
> >
> > One side aspect of why we really don't like unfiltered userspace access to any
> > of these devices is that people start building non standard hacks in and we
> > lose the ecosystem advantages. Forcing a considered discussion + patches
> > to let a particular command be supported, drives standardization.
>
> Like I said above, I think this is not a side aspect. It is fundamental
> to the viability Linux as a project. This project only works because
> organizations with competing goals realize they need some common
> infrastructure and that there is little to be gained by competing on the
> commons.
>
> > https://lore.kernel.org/linux-cxl/CAPcyv4gDShAYih5iWabKg_eTHhuHm54vEAei8ZkcmHnPp3B0cw@mail.gmail.com/
> > provides some history on vendor specific extensions and why in general we
> > won't support them upstream.
>
> Oh, you linked my writeup... I will leave the commentary I added here in case
> restating it helps.
>
> > To address another question raised in an earlier discussion:
> > Putting these Fabric Management interfaces behind guard rails of some type
> > (e.g. CONFIG_IM_A_BMC_AND_CAN_MAKE_A_MESS) does not encourage the risk
> > of non standard interfaces, because we will be even less likely to accept
> > those upstream!
> >
> > If anyone needs more details on any aspect of this please ask.
> > There are a lot of things involved and I've only tried to give a fairly
> > minimal illustration to drive the discussion. I may well have missed
> > something crucial.
>
> You captured it well, and this is open source so I may have missed
> something crucial as well.
>

Thanks for detailed reply!

Jonathan



2024-04-15 20:10:21

by Sreenivas Bagalkote

[permalink] [raw]
Subject: Re: RFC: Restricting userspace interfaces for CXL fabric management

Hello,

>> We need guidance from the community.

>> 1. Datacenter customers must be able to manage PCIe switches in-band.>> 2. Management of switches includes getting health, performance, and error telemetry.>> 3. These telemetry functions are not yet part of the CXL standard>> 4. We built the CCI mailboxes into our PCIe switches per CXL spec and developed our management scheme around them.>> >> If the Linux community does not allow a CXL spec-compliant switch to be>> managed via the CXL spec-defined CCI mailbox, then please guide us on>> the right approach. Please tell us how you propose we manage our switches>> in-band.

I am still looking for your guidance. We need to be able to manage our
switch via the CCI mailbox. We need to use vendor-defined commands per CXL
spec.

You talked about whitelisting commands (allow-list) which we agreed to.
Would you please confirm that you will allow the vendor-defined allow-list
of commands?

Thank you
Sreeni

On Wed, Apr 10, 2024 at 5:45 AM Jonathan Cameron <
[email protected]> wrote:

> On Fri, 5 Apr 2024 17:04:34 -0700
> Dan Williams <[email protected]> wrote:
>
> Hi Dan,
>
> > Jonathan Cameron wrote:
> > > Hi All,
> > >
> > > This is has come up in a number of discussions both on list and in
> private,
> > > so I wanted to lay out a potential set of rules when deciding whether
> or not
> > > to provide a user space interface for a particular feature of CXL
> Fabric
> > > Management. The intent is to drive discussion, not to simply tell
> people
> > > a set of rules. I've brought this to the public lists as it's a Linux
> kernel
> > > policy discussion, not a standards one.
> > >
> > > Whilst I'm writing the RFC this my attempt to summarize a possible
> > > position rather than necessarily being my personal view.
> > >
> > > It's a straw man - shoot at it!
> > >
> > > Not everyone in this discussion is familiar with relevant kernel or
> CXL concepts
> > > so I've provided more info than I normally would.
> >
> > Thanks for writing this up Jonathan!
> >
> > [..]
> > > 2) Unfiltered userspace use of mailbox for Fabric Management - BMC
> kernels
> > >
> ==========================================================================
> > >
> > > (This would just be a kernel option that we'd advise normal server
> > > distributions not to turn on. Would be enabled by openBMC etc)
> > >
> > > This is fine - there is some work to do, but the switch-cci PCI driver
> > > will hopefully be ready for upstream merge soon. There is no filtering
> of
> > > accesses. Think of this as similar to all the damage you can do via
> > > MCTP from a BMC. Similarly it is likely that much of the complexity
> > > of the actual commands will be left to user space tooling:
> > > https://gitlab.com/jic23/cxl-fmapi-tests has some test examples.
> > >
> > > Whether Kconfig help text is strong enough to ensure this only gets
> > > enabled for BMC targeted distros is an open question we can address
> > > alongside an updated patch set.
> >
> > It is not clear to me that this material makes sense to house in
> > drivers/ vs tools/ or even out-of-tree just for maintenance burden
> > relief of keeping the universes separated. What does the Linux kernel
> > project get out of carrying this in mainline alongside the inband code?
>
> I'm not sure what you mean by in band. Aim here was to discuss
> in-band drivers for switch CCI etc. Same reason from a kernel point of
> view for why we include embedded drivers. I'll interpret in band
> as host driven and not inband as FM-API stuff.
>
> > I do think the mailbox refactoring to support non-CXL use cases is
> > interesting, but only so far as refactoring is consumed for inband use
> > cases like RAS API.
>
> If I read this right, I disagree with the 'only so far' bit.
>
> In all substantial ways we should support BMC use case of the Linux Kernel
> at a similar level to how we support forms of Linux Distros. It may
> not be our target market as developers for particular parts of our
> companies,
> but we should not block those who want to support it.
>
> We should support them in drivers/ - maybe with example userspace code
> in tools. Linux distros on BMCs is a big market, there are a number
> of different distros using (and in some cases contributing to) the
> upstream kernel. Not everyone is using openBMC so there is not one
> common place where downstream patches could be carried.
> From a personal point of view, I like that for the same reasons that
> I like there being multiple Linux sever focused distros. It's a sign
> of a healthy ecosystem to have diverse options taking the mainline
> kernel as their starting point.
>
> BMCs are just another embedded market, and like other embedded markets
> we want to encourage upstream first etc.
> openBMC has a policy on this:
> https://github.com/openbmc/docs/blob/master/kernel-development.md
> "The OpenBMC project maintains a kernel tree for use by the project.
> The tree's general development policy is that code must be upstream
> first." There are paths to bypass that for openBMC so it's a little
> more relaxed than some enterprise distros (today, their policies used
> to look very similar to this) but we should not be telling
> them they need to carry support downstream. If we are
> going to tell them that, we need to be able to point at a major
> sticking point for maintenance burden. So far I don't see the
> additional complexity as remotely close reaching that bar.
>
> So I think we do want switch-cci support and for that matter the equivalent
> for MHDs in the upstream kernel.
>
> One place I think there is some wiggle room is the taint on use of raw
> commands. Leaving removal of that for BMC kernels as a patch they need
> to carry downstream doesn't seem too burdensome. I'm sure they'll push
> back if it is a problem for them! So I think we can kick that question
> into the future.
>
> Addressing maintenance burden, there is a question of where we split
> the stack. Ignore MHDs for now (I won't go into why in this forum...)
>
> The current proposal is (simplified to ignore some sharing in lookup code
> etc
> that I can rip out if we think it might be a long term problem)
>
> _____________ _____________________
> | | | |
> | Switch CCI | | Type 3 Driver stack|
> |_____________| |_____________________|
> |___________________________| Whatever GPU etc
> _______|_______ _______|______
> | | | |
> | CXL MBOX | | RAS API etc |
> |_______________| |______________|
> |_____________________________|
> |
> _________|______
> | |
> | MMPT mbox |
> |________________|
>
> Switch CCI Driver: PCI driver doing everything beyond the CXL mbox
> specific bit.
> Type 3 Stack: All the normal stack just with the CXL Mailbox specific
> stuff factored
> out. Note we can move different amounts of shared logic in
> here, but
> in essence it deals with the extra layer on top of the raw
> MMPT mbox.
> MMPT Mbox: Mailbox as per the PCI spec.
> RAS API: Shared RAS API specific infrastructure used by other drivers.
>
> If we see a significant maintenance burden, maybe we duplicate the CXL
> specific
> MBOX layer - I can see advantages in that as there is some stuff not
> relevant
> to the Switch CCI. There will be some duplication of logic however such
> as background command support (which is CXL only IIUC) We can even use
> a difference IOCTL number so the two can diverge if needed in the long run.
>
> e.g. If it makes it easier to get upstream, we can merrily duplicated code
> so that only the bit common with RAS API etc is shared (assuming the
> actually end up with MMPT, not the CXL mailbox which is what their current
> publicly available spec talks about and I assume is a pref MMPT left over?)
>
> _____________ _____________________
> | | | |
> | Switch CCI | | Type 3 Driver stack|
> |_____________| |_____________________|
> | | Whatever GPU etc
> _______|_______ _______|_______ ______|_______
> | | | | | |
> | CXL MBOX | | CXL MBOX | | RAS API etc |
> |_______________| |_______________| |______________|
> |_____________________________|____________________|
> |
> ________|______
> | |
> | MMPT mbox |
> |_______________|
>
>
> > > (On to the one that the "debate" is about)
> > >
> > > 3) Unfiltered user space use of mailbox for Fabric Management - Distro
> kernels
> > >
> =============================================================================
> > > (General purpose Linux Server Distro (Redhat, Suse etc))
> > >
> > > This is equivalent of RAW command support on CXL Type 3 memory devices.
> > > You can enable those in a distro kernel build despite the scary config
> > > help text, but if you use it the kernel is tainted. The result
> > > of the taint is to add a flag to bug reports and print a big message
> to say
> > > that you've used a feature that might result in you shooting yourself
> > > in the foot.
> > >
> > > The taint is there because software is not at first written to deal
> with
> > > everything that can happen smoothly (e.g. surprise removal) It's hard
> > > to survive some of these events, so is never on the initial feature
> list
> > > for any bus, so this flag is just to indicate we have entered a world
> > > where almost all bets are off wrt to stability. We might not know what
> > > a command does so we can't assess the impact (and no one trusts vendor
> > > commands to report affects right in the Command Effects Log - which
> > > in theory tells you if a command can result problems).
> >
> > That is a secondary reason that the taint is there. Yes, it helps
> > upstream not waste their time on bug reports from proprietary use cases,
> > but the effect of that is to make "raw" command mode unattractive for
> > deploying solutions at scale. It clarifies that this interface is a
> > debug-tool that enterprise environment need not worry about.
> >
> > The more salient reason for the taint, speaking only for myself as a
> > Linux kernel community member not for $employer, is to encourage open
> > collaboration. Take firmware-update for example that is a standard
> > command with known side effects that is inaccessible via the ioctl()
> > path. It is placed behind an ABI that is easier to maintain and reason
> > about. Everyone has the firmware update tool if they have the 'cat'
> > command. Distros appreciate the fact that they do not need ship yet
> > another vendor device-update tool, vendors get free tooling and end
> > users also appreciate one flow for all devices.
> >
> > As I alluded here [1], I am not against innovation outside of the
> > specification, but it needs to be open, and it needs to plausibly become
> > if not a de jure standard at least a de facto standard.
> >
> > [1]:
> https://lore.kernel.org/all/CAPcyv4gDShAYih5iWabKg_eTHhuHm54vEAei8ZkcmHnPp3B0cw@mail.gmail.com/
>
> Agree with all this.
>
> >
> > > A concern was raised about GAE/FAST/LDST tables for CXL Fabrics
> > > (a r3.1 feature) but, as I understand it, these are intended for a
> > > host to configure and should not have side effects on other hosts?
> > > My working assumption is that the kernel driver stack will handle
> > > these (once we catch up with the current feature backlog!) Currently
> > > we have no visibility of what the OS driver stack for a fabrics will
> > > actually look like - the spec is just the starting point for that.
> > > (patches welcome ;)
> > >
> > > The various CXL upstream developers and maintainers may have
> > > differing views of course, but my current understanding is we want
> > > to support 1 and 2, but are very resistant to 3!
> >
> > 1, yes, 2, need to see the patches, and agree on 3.
>
> If we end up with top architecture of the diagrams above, 2 will look
> pretty
> similar to last version of the switch-cci patches. So raw commands only +
> taint.
> Factoring out MMPT is another layer that doesn't make that much difference
> in
> practice to this discussion. Good to have, but the reuse here would be one
> layer
> above that.
>
> Or we just say go for second proposed architecture and 0 impact on the
> CXL specific code, just reuse of the MMPT layer. I'd imagine people will
> get
> grumpy on code duplication (and we'll spend years rejecting patch sets that
> try to share the cdoe) but there should be no maintenance burden as
> a result.
>
> >
> > > General Notes
> > > =============
> > >
> > > One side aspect of why we really don't like unfiltered userspace
> access to any
> > > of these devices is that people start building non standard hacks in
> and we
> > > lose the ecosystem advantages. Forcing a considered discussion +
> patches
> > > to let a particular command be supported, drives standardization.
> >
> > Like I said above, I think this is not a side aspect. It is fundamental
> > to the viability Linux as a project. This project only works because
> > organizations with competing goals realize they need some common
> > infrastructure and that there is little to be gained by competing on the
> > commons.
> >
> > >
> https://lore.kernel.org/linux-cxl/CAPcyv4gDShAYih5iWabKg_eTHhuHm54vEAei8ZkcmHnPp3B0cw@mail.gmail.com/
> > > provides some history on vendor specific extensions and why in general
> we
> > > won't support them upstream.
> >
> > Oh, you linked my writeup... I will leave the commentary I added here in
> case
> > restating it helps.
> >
> > > To address another question raised in an earlier discussion:
> > > Putting these Fabric Management interfaces behind guard rails of some
> type
> > > (e.g. CONFIG_IM_A_BMC_AND_CAN_MAKE_A_MESS) does not encourage the risk
> > > of non standard interfaces, because we will be even less likely to
> accept
> > > those upstream!
> > >
> > > If anyone needs more details on any aspect of this please ask.
> > > There are a lot of things involved and I've only tried to give a fairly
> > > minimal illustration to drive the discussion. I may well have missed
> > > something crucial.
> >
> > You captured it well, and this is open source so I may have missed
> > something crucial as well.
> >
>
> Thanks for detailed reply!
>
> Jonathan
>
>
>

--
This electronic communication and the information and any files transmitted
with it, or attached to it, are confidential and are intended solely for
the use of the individual or entity to whom it is addressed and may contain
information that is confidential, legally privileged, protected by privacy
laws, or otherwise restricted from disclosure to anyone else. If you are
not the intended recipient or the person responsible for delivering the
e-mail to the intended recipient, you are hereby notified that any use,
copying, distributing, dissemination, forwarding, printing, or copying of
this e-mail is strictly prohibited. If you received this e-mail in error,
please return the e-mail to the sender, delete it from your computer, and
destroy any printed copy of it.


Attachments:
smime.p7s (4.13 kB)
S/MIME Cryptographic Signature

2024-04-23 22:45:19

by Sreenivas Bagalkote

[permalink] [raw]
Subject: Re: RFC: Restricting userspace interfaces for CXL fabric management

Can somebody please at least acknowledge that you are getting my emails?

Thank you
Sreeni
Sreenivas Bagalkote <[email protected]>
Product Planning & Management
Broadcom Datacenter Solutions Group

On Mon, Apr 15, 2024 at 2:09 PM Sreenivas Bagalkote <
[email protected]> wrote:

> Hello,
>
> >> We need guidance from the community.
>
> >> 1. Datacenter customers must be able to manage PCIe switches in-band.>> 2. Management of switches includes getting health, performance, and error telemetry.>> 3. These telemetry functions are not yet part of the CXL standard>> 4. We built the CCI mailboxes into our PCIe switches per CXL spec and developed our management scheme around them.>> >> If the Linux community does not allow a CXL spec-compliant switch to be>> managed via the CXL spec-defined CCI mailbox, then please guide us on>> the right approach. Please tell us how you propose we manage our switches>> in-band.
>
> I am still looking for your guidance. We need to be able to manage our
> switch via the CCI mailbox. We need to use vendor-defined commands per CXL
> spec.
>
> You talked about whitelisting commands (allow-list) which we agreed to.
> Would you please confirm that you will allow the vendor-defined allow-list
> of commands?
>
> Thank you
> Sreeni
>
> On Wed, Apr 10, 2024 at 5:45 AM Jonathan Cameron <
> [email protected]> wrote:
>
>> On Fri, 5 Apr 2024 17:04:34 -0700
>> Dan Williams <[email protected]> wrote:
>>
>> Hi Dan,
>>
>> > Jonathan Cameron wrote:
>> > > Hi All,
>> > >
>> > > This is has come up in a number of discussions both on list and in
>> private,
>> > > so I wanted to lay out a potential set of rules when deciding whether
>> or not
>> > > to provide a user space interface for a particular feature of CXL
>> Fabric
>> > > Management. The intent is to drive discussion, not to simply tell
>> people
>> > > a set of rules. I've brought this to the public lists as it's a
>> Linux kernel
>> > > policy discussion, not a standards one.
>> > >
>> > > Whilst I'm writing the RFC this my attempt to summarize a possible
>> > > position rather than necessarily being my personal view.
>> > >
>> > > It's a straw man - shoot at it!
>> > >
>> > > Not everyone in this discussion is familiar with relevant kernel or
>> CXL concepts
>> > > so I've provided more info than I normally would.
>> >
>> > Thanks for writing this up Jonathan!
>> >
>> > [..]
>> > > 2) Unfiltered userspace use of mailbox for Fabric Management - BMC
>> kernels
>> > >
>> ==========================================================================
>> > >
>> > > (This would just be a kernel option that we'd advise normal server
>> > > distributions not to turn on. Would be enabled by openBMC etc)
>> > >
>> > > This is fine - there is some work to do, but the switch-cci PCI driver
>> > > will hopefully be ready for upstream merge soon. There is no
>> filtering of
>> > > accesses. Think of this as similar to all the damage you can do via
>> > > MCTP from a BMC. Similarly it is likely that much of the complexity
>> > > of the actual commands will be left to user space tooling:
>> > > https://gitlab.com/jic23/cxl-fmapi-tests has some test examples.
>> > >
>> > > Whether Kconfig help text is strong enough to ensure this only gets
>> > > enabled for BMC targeted distros is an open question we can address
>> > > alongside an updated patch set.
>> >
>> > It is not clear to me that this material makes sense to house in
>> > drivers/ vs tools/ or even out-of-tree just for maintenance burden
>> > relief of keeping the universes separated. What does the Linux kernel
>> > project get out of carrying this in mainline alongside the inband code?
>>
>> I'm not sure what you mean by in band. Aim here was to discuss
>> in-band drivers for switch CCI etc. Same reason from a kernel point of
>> view for why we include embedded drivers. I'll interpret in band
>> as host driven and not inband as FM-API stuff.
>>
>> > I do think the mailbox refactoring to support non-CXL use cases is
>> > interesting, but only so far as refactoring is consumed for inband use
>> > cases like RAS API.
>>
>> If I read this right, I disagree with the 'only so far' bit.
>>
>> In all substantial ways we should support BMC use case of the Linux Kernel
>> at a similar level to how we support forms of Linux Distros. It may
>> not be our target market as developers for particular parts of our
>> companies,
>> but we should not block those who want to support it.
>>
>> We should support them in drivers/ - maybe with example userspace code
>> in tools. Linux distros on BMCs is a big market, there are a number
>> of different distros using (and in some cases contributing to) the
>> upstream kernel. Not everyone is using openBMC so there is not one
>> common place where downstream patches could be carried.
>> From a personal point of view, I like that for the same reasons that
>> I like there being multiple Linux sever focused distros. It's a sign
>> of a healthy ecosystem to have diverse options taking the mainline
>> kernel as their starting point.
>>
>> BMCs are just another embedded market, and like other embedded markets
>> we want to encourage upstream first etc.
>> openBMC has a policy on this:
>> https://github.com/openbmc/docs/blob/master/kernel-development.md
>> "The OpenBMC project maintains a kernel tree for use by the project.
>> The tree's general development policy is that code must be upstream
>> first." There are paths to bypass that for openBMC so it's a little
>> more relaxed than some enterprise distros (today, their policies used
>> to look very similar to this) but we should not be telling
>> them they need to carry support downstream. If we are
>> going to tell them that, we need to be able to point at a major
>> sticking point for maintenance burden. So far I don't see the
>> additional complexity as remotely close reaching that bar.
>>
>> So I think we do want switch-cci support and for that matter the
>> equivalent
>> for MHDs in the upstream kernel.
>>
>> One place I think there is some wiggle room is the taint on use of raw
>> commands. Leaving removal of that for BMC kernels as a patch they need
>> to carry downstream doesn't seem too burdensome. I'm sure they'll push
>> back if it is a problem for them! So I think we can kick that question
>> into the future.
>>
>> Addressing maintenance burden, there is a question of where we split
>> the stack. Ignore MHDs for now (I won't go into why in this forum...)
>>
>> The current proposal is (simplified to ignore some sharing in lookup code
>> etc
>> that I can rip out if we think it might be a long term problem)
>>
>> _____________ _____________________
>> | | | |
>> | Switch CCI | | Type 3 Driver stack|
>> |_____________| |_____________________|
>> |___________________________| Whatever GPU etc
>> _______|_______ _______|______
>> | | | |
>> | CXL MBOX | | RAS API etc |
>> |_______________| |______________|
>> |_____________________________|
>> |
>> _________|______
>> | |
>> | MMPT mbox |
>> |________________|
>>
>> Switch CCI Driver: PCI driver doing everything beyond the CXL mbox
>> specific bit.
>> Type 3 Stack: All the normal stack just with the CXL Mailbox specific
>> stuff factored
>> out. Note we can move different amounts of shared logic in
>> here, but
>> in essence it deals with the extra layer on top of the raw
>> MMPT mbox.
>> MMPT Mbox: Mailbox as per the PCI spec.
>> RAS API: Shared RAS API specific infrastructure used by other drivers.
>>
>> If we see a significant maintenance burden, maybe we duplicate the CXL
>> specific
>> MBOX layer - I can see advantages in that as there is some stuff not
>> relevant
>> to the Switch CCI. There will be some duplication of logic however such
>> as background command support (which is CXL only IIUC) We can even use
>> a difference IOCTL number so the two can diverge if needed in the long
>> run.
>>
>> e.g. If it makes it easier to get upstream, we can merrily duplicated code
>> so that only the bit common with RAS API etc is shared (assuming the
>> actually end up with MMPT, not the CXL mailbox which is what their current
>> publicly available spec talks about and I assume is a pref MMPT left
>> over?)
>>
>> _____________ _____________________
>> | | | |
>> | Switch CCI | | Type 3 Driver stack|
>> |_____________| |_____________________|
>> | | Whatever GPU etc
>> _______|_______ _______|_______ ______|_______
>> | | | | | |
>> | CXL MBOX | | CXL MBOX | | RAS API etc |
>> |_______________| |_______________| |______________|
>> |_____________________________|____________________|
>> |
>> ________|______
>> | |
>> | MMPT mbox |
>> |_______________|
>>
>>
>> > > (On to the one that the "debate" is about)
>> > >
>> > > 3) Unfiltered user space use of mailbox for Fabric Management -
>> Distro kernels
>> > >
>> =============================================================================
>> > > (General purpose Linux Server Distro (Redhat, Suse etc))
>> > >
>> > > This is equivalent of RAW command support on CXL Type 3 memory
>> devices.
>> > > You can enable those in a distro kernel build despite the scary config
>> > > help text, but if you use it the kernel is tainted. The result
>> > > of the taint is to add a flag to bug reports and print a big message
>> to say
>> > > that you've used a feature that might result in you shooting yourself
>> > > in the foot.
>> > >
>> > > The taint is there because software is not at first written to deal
>> with
>> > > everything that can happen smoothly (e.g. surprise removal) It's hard
>> > > to survive some of these events, so is never on the initial feature
>> list
>> > > for any bus, so this flag is just to indicate we have entered a world
>> > > where almost all bets are off wrt to stability. We might not know
>> what
>> > > a command does so we can't assess the impact (and no one trusts vendor
>> > > commands to report affects right in the Command Effects Log - which
>> > > in theory tells you if a command can result problems).
>> >
>> > That is a secondary reason that the taint is there. Yes, it helps
>> > upstream not waste their time on bug reports from proprietary use cases,
>> > but the effect of that is to make "raw" command mode unattractive for
>> > deploying solutions at scale. It clarifies that this interface is a
>> > debug-tool that enterprise environment need not worry about.
>> >
>> > The more salient reason for the taint, speaking only for myself as a
>> > Linux kernel community member not for $employer, is to encourage open
>> > collaboration. Take firmware-update for example that is a standard
>> > command with known side effects that is inaccessible via the ioctl()
>> > path. It is placed behind an ABI that is easier to maintain and reason
>> > about. Everyone has the firmware update tool if they have the 'cat'
>> > command. Distros appreciate the fact that they do not need ship yet
>> > another vendor device-update tool, vendors get free tooling and end
>> > users also appreciate one flow for all devices.
>> >
>> > As I alluded here [1], I am not against innovation outside of the
>> > specification, but it needs to be open, and it needs to plausibly become
>> > if not a de jure standard at least a de facto standard.
>> >
>> > [1]:
>> https://lore.kernel.org/all/CAPcyv4gDShAYih5iWabKg_eTHhuHm54vEAei8ZkcmHnPp3B0cw@mail.gmail.com/
>>
>> Agree with all this.
>>
>> >
>> > > A concern was raised about GAE/FAST/LDST tables for CXL Fabrics
>> > > (a r3.1 feature) but, as I understand it, these are intended for a
>> > > host to configure and should not have side effects on other hosts?
>> > > My working assumption is that the kernel driver stack will handle
>> > > these (once we catch up with the current feature backlog!) Currently
>> > > we have no visibility of what the OS driver stack for a fabrics will
>> > > actually look like - the spec is just the starting point for that.
>> > > (patches welcome ;)
>> > >
>> > > The various CXL upstream developers and maintainers may have
>> > > differing views of course, but my current understanding is we want
>> > > to support 1 and 2, but are very resistant to 3!
>> >
>> > 1, yes, 2, need to see the patches, and agree on 3.
>>
>> If we end up with top architecture of the diagrams above, 2 will look
>> pretty
>> similar to last version of the switch-cci patches. So raw commands only
>> + taint.
>> Factoring out MMPT is another layer that doesn't make that much
>> difference in
>> practice to this discussion. Good to have, but the reuse here would be
>> one layer
>> above that.
>>
>> Or we just say go for second proposed architecture and 0 impact on the
>> CXL specific code, just reuse of the MMPT layer. I'd imagine people will
>> get
>> grumpy on code duplication (and we'll spend years rejecting patch sets
>> that
>> try to share the cdoe) but there should be no maintenance burden as
>> a result.
>>
>> >
>> > > General Notes
>> > > =============
>> > >
>> > > One side aspect of why we really don't like unfiltered userspace
>> access to any
>> > > of these devices is that people start building non standard hacks in
>> and we
>> > > lose the ecosystem advantages. Forcing a considered discussion +
>> patches
>> > > to let a particular command be supported, drives standardization.
>> >
>> > Like I said above, I think this is not a side aspect. It is fundamental
>> > to the viability Linux as a project. This project only works because
>> > organizations with competing goals realize they need some common
>> > infrastructure and that there is little to be gained by competing on the
>> > commons.
>> >
>> > >
>> https://lore.kernel.org/linux-cxl/CAPcyv4gDShAYih5iWabKg_eTHhuHm54vEAei8ZkcmHnPp3B0cw@mail.gmail.com/
>> > > provides some history on vendor specific extensions and why in
>> general we
>> > > won't support them upstream.
>> >
>> > Oh, you linked my writeup... I will leave the commentary I added here
>> in case
>> > restating it helps.
>> >
>> > > To address another question raised in an earlier discussion:
>> > > Putting these Fabric Management interfaces behind guard rails of some
>> type
>> > > (e.g. CONFIG_IM_A_BMC_AND_CAN_MAKE_A_MESS) does not encourage the risk
>> > > of non standard interfaces, because we will be even less likely to
>> accept
>> > > those upstream!
>> > >
>> > > If anyone needs more details on any aspect of this please ask.
>> > > There are a lot of things involved and I've only tried to give a
>> fairly
>> > > minimal illustration to drive the discussion. I may well have missed
>> > > something crucial.
>> >
>> > You captured it well, and this is open source so I may have missed
>> > something crucial as well.
>> >
>>
>> Thanks for detailed reply!
>>
>> Jonathan
>>
>>
>>

--
This electronic communication and the information and any files transmitted
with it, or attached to it, are confidential and are intended solely for
the use of the individual or entity to whom it is addressed and may contain
information that is confidential, legally privileged, protected by privacy
laws, or otherwise restricted from disclosure to anyone else. If you are
not the intended recipient or the person responsible for delivering the
e-mail to the intended recipient, you are hereby notified that any use,
copying, distributing, dissemination, forwarding, printing, or copying of
this e-mail is strictly prohibited. If you received this e-mail in error,
please return the e-mail to the sender, delete it from your computer, and
destroy any printed copy of it.


Attachments:
smime.p7s (4.13 kB)
S/MIME Cryptographic Signature

2024-04-23 23:24:59

by Greg KH

[permalink] [raw]
Subject: Re: RFC: Restricting userspace interfaces for CXL fabric management

On Tue, Apr 23, 2024 at 04:44:37PM -0600, Sreenivas Bagalkote wrote:
> This electronic communication and the information and any files transmitted
> with it, or attached to it, are confidential and are intended solely for
> the use of the individual or entity to whom it is addressed and may contain
> information that is confidential, legally privileged, protected by privacy
> laws, or otherwise restricted from disclosure to anyone else. If you are
> not the intended recipient or the person responsible for delivering the
> e-mail to the intended recipient, you are hereby notified that any use,
> copying, distributing, dissemination, forwarding, printing, or copying of
> this e-mail is strictly prohibited. If you received this e-mail in error,
> please return the e-mail to the sender, delete it from your computer, and
> destroy any printed copy of it.

Now deleted.


2024-04-24 00:08:27

by Dan Williams

[permalink] [raw]
Subject: Re: RFC: Restricting userspace interfaces for CXL fabric management

Jonathan Cameron wrote:
[..]
> > It is not clear to me that this material makes sense to house in
> > drivers/ vs tools/ or even out-of-tree just for maintenance burden
> > relief of keeping the universes separated. What does the Linux kernel
> > project get out of carrying this in mainline alongside the inband code?
>
> I'm not sure what you mean by in band. Aim here was to discuss
> in-band drivers for switch CCI etc. Same reason from a kernel point of
> view for why we include embedded drivers. I'll interpret in band
> as host driven and not inband as FM-API stuff.
>
> > I do think the mailbox refactoring to support non-CXL use cases is
> > interesting, but only so far as refactoring is consumed for inband use
> > cases like RAS API.
>
> If I read this right, I disagree with the 'only so far' bit.
>
> In all substantial ways we should support BMC use case of the Linux Kernel
> at a similar level to how we support forms of Linux Distros.

I think we need to talk in terms of specifics, because in the general
case I do not see the blockage. OpenBMC currently is based on v6.6.28
and carries 136 patches. An additional patch to turn off raw commands
restrictions over there would not even be noticed.

> It may not be our target market as developers for particular parts of
> our companies, but we should not block those who want to support it.

It is also the case that there is a responsibility to build maintainable
kernel interfaces that can be reasoned about, especially with devices as
powerful as CXL that are trusted to host system memory and be caching
agents. For example, I do not want to be in the position of auditing
whether proposed tunnels and passthroughs violate lockdown expectations.

Also, the assertion that these kernels will be built with
CONFIG_SECURITY_LOCKDOWN_LSM=n and likely CONFIG_STRICT_DEVMEM=n, then
the entire user-mode driver ABI is available for use. CXL commands are
simple polled mmio, does Linux really benefit from carrying drivers in
the kernel that the kernel itself does not care about?

[..]
> Switch CCI Driver: PCI driver doing everything beyond the CXL mbox specific bit.
> Type 3 Stack: All the normal stack just with the CXL Mailbox specific stuff factored
> out. Note we can move different amounts of shared logic in here, but
> in essence it deals with the extra layer on top of the raw MMPT mbox.
> MMPT Mbox: Mailbox as per the PCI spec.
> RAS API: Shared RAS API specific infrastructure used by other drivers.

Once the CXL mailbox core is turned into a library for kernel internal
consumers, like RAS API, or CXL accelerators, then it becomes easier to
add a Switch CCI consumer (perhaps as an out-of-tree module in tools/),
but it is still not clear why the kernel benefits from that arrangement.

This is less about blocking developers that have different goals it is
about finding the right projects / places to solve the problem
especially when disjoint design goals are in play and user space drivers
might be in reach.

[..]
> > > The various CXL upstream developers and maintainers may have
> > > differing views of course, but my current understanding is we want
> > > to support 1 and 2, but are very resistant to 3!
> >
> > 1, yes, 2, need to see the patches, and agree on 3.
>
> If we end up with top architecture of the diagrams above, 2 will look pretty
> similar to last version of the switch-cci patches. So raw commands only + taint.
> Factoring out MMPT is another layer that doesn't make that much difference in
> practice to this discussion. Good to have, but the reuse here would be one layer
> above that.
>
> Or we just say go for second proposed architecture and 0 impact on the
> CXL specific code, just reuse of the MMPT layer. I'd imagine people will get
> grumpy on code duplication (and we'll spend years rejecting patch sets that
> try to share the cdoe) but there should be no maintenance burden as
> a result.

I am assuming that the shared code between MMPT and CXL will happen and
that all of the command infrastructure is where centralized policy can
not keep up. If OpenBMC wants to land a driver that consumes the MMPT
core in tools/ that would seem to satisfy both the concerns of mainline
not shipping ABI that host kernels need to strictly reason about while
letting OpenBMC not need to carry out-of-tree patches indefinitely.

2024-04-25 11:34:04

by Jonathan Cameron

[permalink] [raw]
Subject: Re: RFC: Restricting userspace interfaces for CXL fabric management

On Tue, 23 Apr 2024 17:07:58 -0700
Dan Williams <[email protected]> wrote:

> Jonathan Cameron wrote:
> [..]
> > > It is not clear to me that this material makes sense to house in
> > > drivers/ vs tools/ or even out-of-tree just for maintenance burden
> > > relief of keeping the universes separated. What does the Linux kernel
> > > project get out of carrying this in mainline alongside the inband code?
> >
> > I'm not sure what you mean by in band. Aim here was to discuss
> > in-band drivers for switch CCI etc. Same reason from a kernel point of
> > view for why we include embedded drivers. I'll interpret in band
> > as host driven and not inband as FM-API stuff.
> >
> > > I do think the mailbox refactoring to support non-CXL use cases is
> > > interesting, but only so far as refactoring is consumed for inband use
> > > cases like RAS API.
> >
> > If I read this right, I disagree with the 'only so far' bit.
> >
> > In all substantial ways we should support BMC use case of the Linux Kernel
> > at a similar level to how we support forms of Linux Distros.
>
> I think we need to talk in terms of specifics, because in the general
> case I do not see the blockage. OpenBMC currently is based on v6.6.28
> and carries 136 patches. An additional patch to turn off raw commands
> restrictions over there would not even be noticed.

Hi Dan,

That I'm fine with - it's a reasonable middle ground where we ensure
they have a sensible upstream solution, but just patch around the
taint etc in the downstream projects.

Note 136 patches is tiny for a distro and reflects their hard work
upstreaming stuff.

>
> > It may not be our target market as developers for particular parts of
> > our companies, but we should not block those who want to support it.
>
> It is also the case that there is a responsibility to build maintainable
> kernel interfaces that can be reasoned about, especially with devices as
> powerful as CXL that are trusted to host system memory and be caching
> agents. For example, I do not want to be in the position of auditing
> whether proposed tunnels and passthroughs violate lockdown expectations.

I agree with that - this can be made dependent on not locking down
in the same way lots of other somewhat dangerous interfaces are.
We can relax that restriction as things do get audited - sure
tunnels aren't going to be on that allowed list in the short term.

>
> Also, the assertion that these kernels will be built with
> CONFIG_SECURITY_LOCKDOWN_LSM=n and likely CONFIG_STRICT_DEVMEM=n, then
> the entire user-mode driver ABI is available for use. CXL commands are
> simple polled mmio, does Linux really benefit from carrying drivers in
> the kernel that the kernel itself does not care about?

Sure we could it in userspace... It's bad engineering, limits the design
to polling only and uses a bunch of interfaces we put a lot of effort into
telling people not to use except for debug.

I really don't see the advantage in pushing a project/group of projects
all of which are picking the upstream kernel up directly, to do a dirty
hack. We loose all the advantages of a proper well maintained kernel
driver purely on the argument that one use model is not the same as
this one. Sensible security lockdown requirements is fine (along
with all the other kernel features that must be disable for that
to work), making open kernel development on for a large Linux
market harder is not.

>
> [..]
> > Switch CCI Driver: PCI driver doing everything beyond the CXL mbox specific bit.
> > Type 3 Stack: All the normal stack just with the CXL Mailbox specific stuff factored
> > out. Note we can move different amounts of shared logic in here, but
> > in essence it deals with the extra layer on top of the raw MMPT mbox.
> > MMPT Mbox: Mailbox as per the PCI spec.
> > RAS API: Shared RAS API specific infrastructure used by other drivers.
>
> Once the CXL mailbox core is turned into a library for kernel internal
> consumers, like RAS API, or CXL accelerators, then it becomes easier to
> add a Switch CCI consumer (perhaps as an out-of-tree module in tools/),
> but it is still not clear why the kernel benefits from that arrangement.

We can argue later on this. But from my point of view, in tree module not
in tools is a must. Doesn't have to be in drivers/cxl if its simply the
association aspect that is a blocker.

>
> This is less about blocking developers that have different goals it is
> about finding the right projects / places to solve the problem
> especially when disjoint design goals are in play and user space drivers
> might be in reach.

Key here is this is not a case of openBMC is the one true distro on
which all Linux BMCs and fabric management platforms are based.
So we are really talking a random out of tree driver with all the maintenance
overhead that brings. Yuck.

So I don't see there being any good solution out side of upstream support
other than pushing this to be a userspace hack.

>
> [..]
> > > > The various CXL upstream developers and maintainers may have
> > > > differing views of course, but my current understanding is we want
> > > > to support 1 and 2, but are very resistant to 3!
> > >
> > > 1, yes, 2, need to see the patches, and agree on 3.
> >
> > If we end up with top architecture of the diagrams above, 2 will look pretty
> > similar to last version of the switch-cci patches. So raw commands only + taint.
> > Factoring out MMPT is another layer that doesn't make that much difference in
> > practice to this discussion. Good to have, but the reuse here would be one layer
> > above that.
> >
> > Or we just say go for second proposed architecture and 0 impact on the
> > CXL specific code, just reuse of the MMPT layer. I'd imagine people will get
> > grumpy on code duplication (and we'll spend years rejecting patch sets that
> > try to share the cdoe) but there should be no maintenance burden as
> > a result.
>
> I am assuming that the shared code between MMPT and CXL will happen and
> that all of the command infrastructure is where centralized policy can
> not keep up.

There is actually very little to MMPT, but sure there will be some sharing
of code and the policy won't sit in that shared part as it is protocol
specific.

> If OpenBMC wants to land a driver that consumes the MMPT
> core in tools/ that would seem to satisfy both the concerns of mainline
> not shipping ABI that host kernels need to strictly reason about while
> letting OpenBMC not need to carry out-of-tree patches indefinitely.

We can argue that detail later, but tools is not in my opinion
a valid solution to supporting properly maintained upstream drivers.
Its a hack for test and example modules only. The path of just
blocking this driver in any locked down situation seems much more inline
with kernel norms. It is also extremely bad precedence.

Jonathan

p.s. I don't care in the slightest about openBMC (other than general
warm fuzzy feelings about a good open source project), I do care
rather more about BMCs and other fabric managers.





2024-04-25 16:19:12

by Dan Williams

[permalink] [raw]
Subject: Re: RFC: Restricting userspace interfaces for CXL fabric management

Jonathan Cameron wrote:
[..]
> > Also, the assertion that these kernels will be built with
> > CONFIG_SECURITY_LOCKDOWN_LSM=n and likely CONFIG_STRICT_DEVMEM=n, then
> > the entire user-mode driver ABI is available for use. CXL commands are
> > simple polled mmio, does Linux really benefit from carrying drivers in
> > the kernel that the kernel itself does not care about?
>
> Sure we could it in userspace... It's bad engineering, limits the design
> to polling only and uses a bunch of interfaces we put a lot of effort into
> telling people not to use except for debug.
>
> I really don't see the advantage in pushing a project/group of projects
> all of which are picking the upstream kernel up directly, to do a dirty
> hack. We loose all the advantages of a proper well maintained kernel
> driver purely on the argument that one use model is not the same as
> this one. Sensible security lockdown requirements is fine (along
> with all the other kernel features that must be disable for that
> to work), making open kernel development on for a large Linux
> market harder is not.

The minimum requirement for justifying an in kernel driver is that
something else in the kernel consumes that facility. So, again, I want
to get back to specifics what else in the kernel is going to leverage
the Switch CCI mailbox?

The generic-Type-3-device mailbox has an in kernel driver because the
kernel has need to send mailbox commands internally and it is
fundamental to RAS and provisioning flows that the kernel have this
coordination. What are the motivations for an in-band Switch CCI command
submission path?

It could be the case that you have a self-evident example in mind that I
have thus far failed to realize.

2024-04-25 17:31:58

by Jonathan Cameron

[permalink] [raw]
Subject: Re: RFC: Restricting userspace interfaces for CXL fabric management

On Thu, 25 Apr 2024 09:18:53 -0700
Dan Williams <[email protected]> wrote:

> Jonathan Cameron wrote:
> [..]
> > > Also, the assertion that these kernels will be built with
> > > CONFIG_SECURITY_LOCKDOWN_LSM=n and likely CONFIG_STRICT_DEVMEM=n, then
> > > the entire user-mode driver ABI is available for use. CXL commands are
> > > simple polled mmio, does Linux really benefit from carrying drivers in
> > > the kernel that the kernel itself does not care about?
> >
> > Sure we could it in userspace... It's bad engineering, limits the design
> > to polling only and uses a bunch of interfaces we put a lot of effort into
> > telling people not to use except for debug.
> >
> > I really don't see the advantage in pushing a project/group of projects
> > all of which are picking the upstream kernel up directly, to do a dirty
> > hack. We loose all the advantages of a proper well maintained kernel
> > driver purely on the argument that one use model is not the same as
> > this one. Sensible security lockdown requirements is fine (along
> > with all the other kernel features that must be disable for that
> > to work), making open kernel development on for a large Linux
> > market harder is not.
>
> The minimum requirement for justifying an in kernel driver is that
> something else in the kernel consumes that facility. So, again, I want
> to get back to specifics what else in the kernel is going to leverage
> the Switch CCI mailbox?

Why? I've never heard of such as a requirement and numerous drivers
provide fairly direct access to hardware. Sometimes there is a subsystem
aiding the data formatting etc, but fundamentally that's a convenience.

Taking this to a silly level, on this basis all networking drivers would
not be in the kernel. They are there mainly to provide userspace access to
a network. Any of the hardware access subsystems such hwmon, input, IIO
etc are primarily about providing a convenient way to get data to/from
a device. They are kernel drivers because that is the cleaner path
for data marshaling, interrupt handling etc.

In kernel users are a perfectly valid reason to have a kernel driver,
but it's far from the only one. None of the AI accelerators have in kernel
users today (maybe they will in future). Sure there are other arguments
that mean only a few such devices have been upstreamed, but it's not
that they need in kernel users. If it's really an issue I'll just submit
it to driver/misc and Greg can take a view on whether it's an acceptable
device to have driver for... (after he's asked the obvious question of
why aren't the CXL folk taking it!) +cc Greg to save providing info later.

For background this is a PCI function with a mailbox used for switch
configuration. The mailbox is identical to the one found on CXL type3
devices. Whole thing defined in the CXL spec. It gets a little complex
because you can tunnel commands to devices connected to the switch,
potentially affecting other hosts. Typical Linux device doing this
would be a BMC, but there have been repeated questions about providing
a subset of access to any Linux system (avoiding the foot guns)
Whole thing fully discoverable - proposal is a standard PCI driver.

>
> The generic-Type-3-device mailbox has an in kernel driver because the
> kernel has need to send mailbox commands internally and it is
> fundamental to RAS and provisioning flows that the kernel have this
> coordination. What are the motivations for an in-band Switch CCI command
> submission path?
>
> It could be the case that you have a self-evident example in mind that I
> have thus far failed to realize.
>

There are possibilities, but for now it's a transport driver just like
MCTP etc with a well defined chardev interface, with documented ioctl
interface etc (which I'd keep inline with one the CXL mailbox uses
just to avoid reinventing the wheel - I'd prefer to use that directly
to avoid divergence but I don't care that much).

As far as I can see, with the security / blast radius concern alleviated
by disabling this if lockdown is in use + taint for unaudited commands
(and a nasty sounding config similar to the cxl mailbox one)
there is little reason not to take such a driver into the kernel.
It has next to no maintenance impact outside of itself and a bit of
library code which I've proposed pushing down to the level of MMPT
(so PCI not CXL) if you think that is necessary.

We want interrupt handling and basic access controls / command
interface to userspace.

Apologies if I'm grumpy - several long days of battling cpu hotplug code.

Jonathan



2024-04-25 19:25:55

by Dan Williams

[permalink] [raw]
Subject: Re: RFC: Restricting userspace interfaces for CXL fabric management

Jonathan Cameron wrote:
> Dan Williams <[email protected]> wrote:
> > The minimum requirement for justifying an in kernel driver is that
> > something else in the kernel consumes that facility. So, again, I want
> > to get back to specifics what else in the kernel is going to leverage
> > the Switch CCI mailbox?
>
> Why? I've never heard of such as a requirement and numerous drivers
> provide fairly direct access to hardware. Sometimes there is a subsystem
> aiding the data formatting etc, but fundamentally that's a convenience.
>
> Taking this to a silly level, on this basis all networking drivers would
> not be in the kernel. They are there mainly to provide userspace access to
> a network.

Networking is an odd choice to bring into this discussion because that
subsystem has a long history of wrestling with the "kernel bypass"
concern. It has largely been able to weather the storm of calls to get
out of the way and let vendor drivers have free reign.

The AF_XDP socket family was the result of finding a path to let
userspace networking stacks build functionality without forfeiting the
relevance and ongoing collaboration on the in-kernel stack.

> Any of the hardware access subsystems such hwmon, input, IIO
> etc are primarily about providing a convenient way to get data to/from
> a device. They are kernel drivers because that is the cleaner path
> for data marshaling, interrupt handling etc.

Those are drivers supporting a subsystem to bring a sane kernel
interface to front potenitally multiple vendor implementations of
similar functionality.

They are not asking for kernel bypass facilities that defeat the purpose
of ever talking to the kernel community again for potentially
system-integrity violating functionality behind disparate vendor
interfaces.

> In kernel users are a perfectly valid reason to have a kernel driver,
> but it's far from the only one. None of the AI accelerators have in kernel
> users today (maybe they will in future). Sure there are other arguments
> that mean only a few such devices have been upstreamed, but it's not
> that they need in kernel users. If it's really an issue I'll just submit
> it to driver/misc and Greg can take a view on whether it's an acceptable
> device to have driver for... (after he's asked the obvious question of
> why aren't the CXL folk taking it!) +cc Greg to save providing info later.

AI accelerators are heavy consumers of the core-mm you can not
reasonably coordinate with the core-mm from userspace.

If the proposal is to build a new CXL Fabric Management subsystem with
proper ABIs and openly defined command sets that will sit behind thought
out kernel interfaces then I can get on board with that.

Where I am stuck currently is the assertion that step 1 is "build ioctl
passthrough tunnels with 'do anything you want and get away with it'
semantics".

Recall that the current restriction for raw commands was to encourage
vendor collaboration and building sane kernel interfaces, and that
distros would enable it in their "debug" kernels to enable hardware
validation test benches. If the assertion is "that's too restrictive,
enable a vendor ecosystem based on kernel bypass" that goes too far.

> For background this is a PCI function with a mailbox used for switch
> configuration. The mailbox is identical to the one found on CXL type3
> devices. Whole thing defined in the CXL spec. It gets a little complex
> because you can tunnel commands to devices connected to the switch,
> potentially affecting other hosts. Typical Linux device doing this
> would be a BMC, but there have been repeated questions about providing
> a subset of access to any Linux system (avoiding the foot guns)
> Whole thing fully discoverable - proposal is a standard PCI driver.
>
> > The generic-Type-3-device mailbox has an in kernel driver because the
> > kernel has need to send mailbox commands internally and it is
> > fundamental to RAS and provisioning flows that the kernel have this
> > coordination. What are the motivations for an in-band Switch CCI command
> > submission path?
> >
> > It could be the case that you have a self-evident example in mind that I
> > have thus far failed to realize.
> >
>
> There are possibilities, but for now it's a transport driver just like
> MCTP etc with a well defined chardev interface, with documented ioctl
> interface etc (which I'd keep inline with one the CXL mailbox uses
> just to avoid reinventing the wheel - I'd prefer to use that directly
> to avoid divergence but I don't care that much).
>
> As far as I can see, with the security / blast radius concern alleviated
> by disabling this if lockdown is in use + taint for unaudited commands
> (and a nasty sounding config similar to the cxl mailbox one)
> there is little reason not to take such a driver into the kernel.
> It has next to no maintenance impact outside of itself and a bit of
> library code which I've proposed pushing down to the level of MMPT
> (so PCI not CXL) if you think that is necessary.
>
> We want interrupt handling and basic access controls / command
> interface to userspace.
>
> Apologies if I'm grumpy - several long days of battling cpu hotplug code.

Again, can we please get back to the specifics of the commands to be
enabled here? I am open to CXL Fabric Management as a first class
citizen, I am not currently open to CXL Fabric Management gets to live
in the corner of the kernel that is unreviewable because all it does is
take opaque ioctl blobs and marshal them to hardware.

2024-04-26 08:46:08

by Jonathan Cameron

[permalink] [raw]
Subject: Re: RFC: Restricting userspace interfaces for CXL fabric management

On Thu, 25 Apr 2024 12:25:35 -0700
Dan Williams <[email protected]> wrote:

> Jonathan Cameron wrote:
> > Dan Williams <[email protected]> wrote:
> > > The minimum requirement for justifying an in kernel driver is that
> > > something else in the kernel consumes that facility. So, again, I want
> > > to get back to specifics what else in the kernel is going to leverage
> > > the Switch CCI mailbox?
> >
> > Why? I've never heard of such as a requirement and numerous drivers
> > provide fairly direct access to hardware. Sometimes there is a subsystem
> > aiding the data formatting etc, but fundamentally that's a convenience.
> >
> > Taking this to a silly level, on this basis all networking drivers would
> > not be in the kernel. They are there mainly to provide userspace access to
> > a network.
>
> Networking is an odd choice to bring into this discussion because that
> subsystem has a long history of wrestling with the "kernel bypass"
> concern. It has largely been able to weather the storm of calls to get
> out of the way and let vendor drivers have free reign.
>
> The AF_XDP socket family was the result of finding a path to let
> userspace networking stacks build functionality without forfeiting the
> relevance and ongoing collaboration on the in-kernel stack.

This first chunk of my reply was all about poking holes in the 'must
have an in kernel user' not trying to make any broader points. That argument
is one bullet of perhaps 20 good reasons for in-kernel support, you've listed
quite a few more so point hopefully made.

I fully agree with the other reasons you've given for why things
are in the kernel, but key here is we don't need to meet the
in-kernel user requirement. Here the focus is on encouraging standards
adoption.

>
> > Any of the hardware access subsystems such hwmon, input, IIO
> > etc are primarily about providing a convenient way to get data to/from
> > a device. They are kernel drivers because that is the cleaner path
> > for data marshaling, interrupt handling etc.
>
> Those are drivers supporting a subsystem to bring a sane kernel
> interface to front potenitally multiple vendor implementations of
> similar functionality.

I have painful memories of early feedback on IIO that was very similar
to arguments here - it wasn't far off killing of the whole effort.

>
> They are not asking for kernel bypass facilities that defeat the purpose
> of ever talking to the kernel community again for potentially
> system-integrity violating functionality behind disparate vendor
> interfaces.

I'm not asking for that at all. This discussion got a bit derailed
though so I can see why you might have taken that impression away.

>
> > In kernel users are a perfectly valid reason to have a kernel driver,
> > but it's far from the only one. None of the AI accelerators have in kernel
> > users today (maybe they will in future). Sure there are other arguments
> > that mean only a few such devices have been upstreamed, but it's not
> > that they need in kernel users. If it's really an issue I'll just submit
> > it to driver/misc and Greg can take a view on whether it's an acceptable
> > device to have driver for... (after he's asked the obvious question of
> > why aren't the CXL folk taking it!) +cc Greg to save providing info later.
>
> AI accelerators are heavy consumers of the core-mm you can not
> reasonably coordinate with the core-mm from userspace.
>
> If the proposal is to build a new CXL Fabric Management subsystem with
> proper ABIs and openly defined command sets that will sit behind thought
> out kernel interfaces then I can get on board with that.
>
> Where I am stuck currently is the assertion that step 1 is "build ioctl
> passthrough tunnels with 'do anything you want and get away with it'
> semantics".

If you think step 1 is the end goal, then indeed I see your problem.

Reality is that this stuff is needed today and similar to type 3 there
will be people who use the raw interface to enabled their custom stuff
- we can enable that but put similar measures in place to to encourage
upstreaming / standarization. I should perhaps not have diverted
into the fact openBMC 'might' patch out the taint. I suspect some
server distros will do that for type 3 devices to, but the advantages
to vendors of playing nicely with upstream still apply. I do however
want to encourage openBMC folk to collaborate on support and if that
means letting them play fast and loose with what they do downstream for
now that's fine - they will not want to carry such patches long term anyway.

The end goal is to enable multiple things:

1) Standardization: That's exactly why I proposed that if broadcomm want
to have upstream support for their interfaces they should do two things:
a) Take a proposal to the consortium to extend the standard definition
to incorporate those features (we want to see them playing the game
so everyone benefits!)
b) Provide clear description of the vendor defined commands so that
the safe ones can be audited and enabled as 'defacto' standards.
We do this with lots of other things - if you are playing the game
nicely we relax the rules a little to encourage you to keep doing so.
This is the path being proposed for the upstream driver.

2) A path sharing that same standard infrastructure for the more destructive
commands. That can have all the protections we want to add.
For the BMC stacks some of that stuff will be vital and it's complex
so there is an argument that not all that belongs in the kernel or at
least not today. One of the other comments I recall for the raw command
support in the type 3 driver was to allow people to test stuff that was
safe, was in the spec, but was not yet audited by the driver. I'd put
a lot of this in that category. We can absolutely do tunneled commands
nicely and filter them for destructive activity etc - it just doesn't
belong in the first patch set.
Ultimately I would expect approaches to enable the destructive commands
as well via standard interfaces, but those will of course need guard
rails / opt in etc.

I never intended to propose that the raw command path is the main one
used - that's there for the same reasons we have one in CXL type3, though
here I suspect the path to supporting everything needed via non
raw interfaces will be longer.

>
> Recall that the current restriction for raw commands was to encourage
> vendor collaboration and building sane kernel interfaces, and that
> distros would enable it in their "debug" kernels to enable hardware
> validation test benches. If the assertion is "that's too restrictive,
> enable a vendor ecosystem based on kernel bypass" that goes too far.

This is a clear description of the concern and it's one I fully share.
My interpretation of your earlier responses was clearly wrong - this
is about avoiding kernel bypass and encouraging standardization.
Your comments on doing user space drivers read to me like you were
telling the switch vendors to go do exactly that and that is why
I was pushing back strongly! All the other responses were based on
me trying to answer other concerns (maintainability for example)
rather than this one.

WRT to relationship to the raw commands: this is the same proposal.
Much like the initial CXL mailbox support provide a raw bypass for
the cases not supported and put the same guards around it (or tighter
ones). That is better in the upstream kernel than pushing the problem
downstream and basically removing any reason at all for switch vendors
to do things right. Give them a path and they will come. So far
what I suspect they are seeing is no path at all - driving fragmentation.
(handy switch vendors in the CC list feel free to say how you are interpretting
this discussion!)

If we'd had upstream support already in place with those restrictions
the rules for switch vendors would have been clear and (assuming they
have some active ecosystem folk) their design would have taken this
into account.

The early switch-cci code proposal was exactly the same as the CXL mailbox.
It had a bunch of commands (IIRC) included get temperature, get timestamp etc
that were safe and handled in exactly the same fashion as the in
kernel interfaces. The reason that I ripped that out was to reduce
the amount and complexity of the first proposal so the focus would lie
on the necessary refactoring of the CXL core to allow for code reuse.
That stuff will comeback in patch set 2 - we can easily put it back in
the first patch set if that helps make the strategy clear - it's just
a few lines of code.

All the infrastructure is still there to support safe commands vs unsafe ones.
Same logic, same approach and actually the same code as for the CXL mailbox.
I fully agree with all that approach.

>
> > For background this is a PCI function with a mailbox used for switch
> > configuration. The mailbox is identical to the one found on CXL type3
> > devices. Whole thing defined in the CXL spec. It gets a little complex
> > because you can tunnel commands to devices connected to the switch,
> > potentially affecting other hosts. Typical Linux device doing this
> > would be a BMC, but there have been repeated questions about providing
> > a subset of access to any Linux system (avoiding the foot guns)
> > Whole thing fully discoverable - proposal is a standard PCI driver.
> >
> > > The generic-Type-3-device mailbox has an in kernel driver because the
> > > kernel has need to send mailbox commands internally and it is
> > > fundamental to RAS and provisioning flows that the kernel have this
> > > coordination. What are the motivations for an in-band Switch CCI command
> > > submission path?
> > >
> > > It could be the case that you have a self-evident example in mind that I
> > > have thus far failed to realize.
> > >
> >
> > There are possibilities, but for now it's a transport driver just like
> > MCTP etc with a well defined chardev interface, with documented ioctl
> > interface etc (which I'd keep inline with one the CXL mailbox uses
> > just to avoid reinventing the wheel - I'd prefer to use that directly
> > to avoid divergence but I don't care that much).
> >
> > As far as I can see, with the security / blast radius concern alleviated
> > by disabling this if lockdown is in use + taint for unaudited commands
> > (and a nasty sounding config similar to the cxl mailbox one)
> > there is little reason not to take such a driver into the kernel.
> > It has next to no maintenance impact outside of itself and a bit of
> > library code which I've proposed pushing down to the level of MMPT
> > (so PCI not CXL) if you think that is necessary.
> >
> > We want interrupt handling and basic access controls / command
> > interface to userspace.
> >
> > Apologies if I'm grumpy - several long days of battling cpu hotplug code.
>
> Again, can we please get back to the specifics of the commands to be
> enabled here? I am open to CXL Fabric Management as a first class
> citizen, I am not currently open to CXL Fabric Management gets to live
> in the corner of the kernel that is unreviewable because all it does is
> take opaque ioctl blobs and marshal them to hardware.
>

That I'm fully on board with. However there are plenty of spec defined
commands that in this safe category and to show how this works we can
implement those in parallel to a discussion about those defined outside
the CXL specification.

Encouraging people to do things via a hacky userspace driver as I read
you as doing earlier in this thread will have given exactly the opposite
impression.

To give people an incentive to play the standards game we have to
provide an alternative. Userspace libraries will provide some incentive
to standardize if we have enough vendors (we don't today - so they will
do their own libraries), but it is a lot easier to encourage if we
exercise control over the interface.

Jonathan


2024-04-26 16:37:57

by Dan Williams

[permalink] [raw]
Subject: Re: RFC: Restricting userspace interfaces for CXL fabric management

Jonathan Cameron wrote:
[..]
> To give people an incentive to play the standards game we have to
> provide an alternative. Userspace libraries will provide some incentive
> to standardize if we have enough vendors (we don't today - so they will
> do their own libraries), but it is a lot easier to encourage if we
> exercise control over the interface.

Yes, and I expect you and I are not far off on what can be done
here.

However, lets cut to a sentiment hanging over this discussion. Referring
to vendor specific commands:

"CXL spec has them for a reason and they need to be supported."

..that is an aggressive "vendor specific first" sentiment that
generates an aggressive "userspace drivers" reaction, because the best
way to get around community discussions about what ABI makes sense is
userspace drivers.

Now, if we can step back to where this discussion started, where typical
Linux collaboration shines, and where I think you and I are more aligned
than this thread would indicate, is "vendor specific last". Lets
carefully consider the vendor specific commands that are candidates to
be de facto cross vendor semantics if not de jure standards.

2024-04-26 16:53:58

by Jonathan Cameron

[permalink] [raw]
Subject: Re: RFC: Restricting userspace interfaces for CXL fabric management

On Fri, 26 Apr 2024 09:16:44 -0700
Dan Williams <[email protected]> wrote:

> Jonathan Cameron wrote:
> [..]
> > To give people an incentive to play the standards game we have to
> > provide an alternative. Userspace libraries will provide some incentive
> > to standardize if we have enough vendors (we don't today - so they will
> > do their own libraries), but it is a lot easier to encourage if we
> > exercise control over the interface.
>
> Yes, and I expect you and I are not far off on what can be done
> here.
>
> However, lets cut to a sentiment hanging over this discussion. Referring
> to vendor specific commands:
>
> "CXL spec has them for a reason and they need to be supported."
>
> ...that is an aggressive "vendor specific first" sentiment that
> generates an aggressive "userspace drivers" reaction, because the best
> way to get around community discussions about what ABI makes sense is
> userspace drivers.
>
> Now, if we can step back to where this discussion started, where typical
> Linux collaboration shines, and where I think you and I are more aligned
> than this thread would indicate, is "vendor specific last". Lets
> carefully consider the vendor specific commands that are candidates to
> be de facto cross vendor semantics if not de jure standards.
>

Agreed. I'd go a little further and say I generally have much more warm and
fuzzy feelings when what is a vendor defined command (today) maps to more
or less the same bit of code for a proposed standards ECN.

IP rules prevent us commenting on specific proposals, but there will be
things we review quicker and with a lighter touch vs others where we
ask lots of annoying questions about generality of the feature etc.
Given the effort we are putting in on the kernel side we all want CXL
to succeed and will do our best to encourage activities that make that
more likely. There are other standards bodies available... which may
make more sense for some features.

Command interfaces are not a good place to compete and maintain secrecy.
If vendors want to do that, then they don't get the pony of upstream
support. They get to convince distros to do a custom kernel build for them:
Good luck with that, some of those folk are 'blunt' in their responses to
such requests.

My proposal is we go forward with a bunch of the CXL spec defined commands
to show the 'how' and consider specific proposals for upstream support
of vendor defined commands on a case by case basis (so pretty much
what you say above). Maybe after a few are done we can formalize some
rules of thumb help vendors makes such proposals, though maybe some
will figure out it is a better and longer term solution to do 'standards
first development'.

I think we do need to look at the safety filtering of tunneled
commands but don't see that as a particularly tricky addition -
for the simple non destructive commands at least.

Jonathan


2024-04-26 20:41:19

by Harold Johnson

[permalink] [raw]
Subject: RE: RFC: Restricting userspace interfaces for CXL fabric management

Perhaps a bit more color on a few specifics might be helpful.

I think that there will always be a class of vendor specific APIs/Opcodes
that are related to an implementation of a standard instead of the
standard itself. I've been party to discussion on not creating CXL
defined API/Opcodes that get into the realm of specifying an
implementation. There are also a class of data that can be collected from
a specific implementation that is helpful for debug, for health
monitoring, and perhaps performance monitoring where the implementation
matters and therefore are not easily abstracted to a standard.

A few examples:
a) Temperature monitoring of a component or internal chip die
temperatures. Could CXL define a standard OpCode to gather temperatures,
yes it could; but is this really part of CXL? Then how many temperature
elements and what does each element mean? This enters into the
implementation and therefore is vendor specific. Unless the CXL spec
starts to define the implementation, something along the lines of "thou
shall have an average die temperature, rather than specific temperatures
across a die", etc.

b) Error counters, metrics, internal counters, etc. Could CXL define a
set of common error counters, absolutely. PCIe has done some of this.
However, a specific implementation may have counters and error reporting
that are meaningful only to a specific design and a specific
implementation rather than a "least common denominator" approach of a
standard body.

c) Performance counters, metric, indicators, etc. Performance can be very
implementation specific and tweaking performance is likely to be
implementation specific. Yes, generic and a least common denominator
elements could be created, but are likely to limiting in realizing the
maximum performance of an implementation.

d) Logs, errors and debug information. In addition to spec defined
logging of CXL topology errors, specific designs will have logs, crash
dumps, debug data that is very specific to a implementation. There are
likely to be cases where a product that conforms to a specification like
CXL, may have features that don't directly have anything to do with CXL,
but where a standards based management interface can be used to configure,
manage, and collect data for a non-CXL feature.

e) Innovation. I believe that innovation should be encouraged. There may
be designs that support CXL, but that also incorporate unique and
innovative features or functions that might service a niche market. The
AI space is ripe for innovation and perhaps specialized features that may
not make sense for the overall CXL specification.

I think that in most cases Vendor specific opcodes are not used to
circumvent the standards, but are used when the standards group has no
interested in driving into the standard certain features that are clearly
either implementation specific or are vendor specific additions that have
a specific appeal to a select class of customer, but yet are not relevant
to a specific standard.

At the end of the day, customer want products that solve a specific
problem. Sometimes vendor can address market segments or niches that a
standard group has no interest in supporting. It can also take months,
and in some cases years to reach an agreement on what standardized feature
should look like. I also believe that there can be competitive reasons
why there might be a group that wants to slow down a vendor's
implementation for fear of losing market share.

Thanks
Harold Johnson


-----Original Message-----
From: Jonathan Cameron [mailto:[email protected]]
Sent: Friday, April 26, 2024 11:54 AM
To: Dan Williams
Cc: [email protected]; Sreenivas Bagalkote; Brett Henning; Harold
Johnson; Sumanesh Samanta; [email protected]; Davidlohr Bueso;
Dave Jiang; Alison Schofield; Vishal Verma; Ira Weiny;
[email protected]; [email protected]; Lorenzo Pieralisi; Natu,
Mahesh; [email protected]
Subject: Re: RFC: Restricting userspace interfaces for CXL fabric
management

On Fri, 26 Apr 2024 09:16:44 -0700
Dan Williams <[email protected]> wrote:

> Jonathan Cameron wrote:
> [..]
> > To give people an incentive to play the standards game we have to
> > provide an alternative. Userspace libraries will provide some
incentive
> > to standardize if we have enough vendors (we don't today - so they
will
> > do their own libraries), but it is a lot easier to encourage if we
> > exercise control over the interface.
>
> Yes, and I expect you and I are not far off on what can be done
> here.
>
> However, lets cut to a sentiment hanging over this discussion. Referring
> to vendor specific commands:
>
> "CXL spec has them for a reason and they need to be supported."
>
> ...that is an aggressive "vendor specific first" sentiment that
> generates an aggressive "userspace drivers" reaction, because the best
> way to get around community discussions about what ABI makes sense is
> userspace drivers.
>
> Now, if we can step back to where this discussion started, where typical
> Linux collaboration shines, and where I think you and I are more aligned
> than this thread would indicate, is "vendor specific last". Lets
> carefully consider the vendor specific commands that are candidates to
> be de facto cross vendor semantics if not de jure standards.
>

Agreed. I'd go a little further and say I generally have much more warm
and
fuzzy feelings when what is a vendor defined command (today) maps to more
or less the same bit of code for a proposed standards ECN.

IP rules prevent us commenting on specific proposals, but there will be
things we review quicker and with a lighter touch vs others where we
ask lots of annoying questions about generality of the feature etc.
Given the effort we are putting in on the kernel side we all want CXL
to succeed and will do our best to encourage activities that make that
more likely. There are other standards bodies available... which may
make more sense for some features.

Command interfaces are not a good place to compete and maintain secrecy.
If vendors want to do that, then they don't get the pony of upstream
support. They get to convince distros to do a custom kernel build for
them:
Good luck with that, some of those folk are 'blunt' in their responses to
such requests.

My proposal is we go forward with a bunch of the CXL spec defined commands
to show the 'how' and consider specific proposals for upstream support
of vendor defined commands on a case by case basis (so pretty much
what you say above). Maybe after a few are done we can formalize some
rules of thumb help vendors makes such proposals, though maybe some
will figure out it is a better and longer term solution to do 'standards
first development'.

I think we do need to look at the safety filtering of tunneled
commands but don't see that as a particularly tricky addition -
for the simple non destructive commands at least.

Jonathan

--
This electronic communication and the information and any files transmitted
with it, or attached to it, are confidential and are intended solely for
the use of the individual or entity to whom it is addressed and may contain
information that is confidential, legally privileged, protected by privacy
laws, or otherwise restricted from disclosure to anyone else. If you are
not the intended recipient or the person responsible for delivering the
e-mail to the intended recipient, you are hereby notified that any use,
copying, distributing, dissemination, forwarding, printing, or copying of
this e-mail is strictly prohibited. If you received this e-mail in error,
please return the e-mail to the sender, delete it from your computer, and
destroy any printed copy of it.


Attachments:
smime.p7s (4.12 kB)
S/MIME Cryptographic Signature

2024-04-27 11:12:47

by Greg KH

[permalink] [raw]
Subject: Re: RFC: Restricting userspace interfaces for CXL fabric management

On Fri, Apr 26, 2024 at 02:25:29PM -0500, Harold Johnson wrote:
> A few examples:
> a) Temperature monitoring of a component or internal chip die
> temperatures. Could CXL define a standard OpCode to gather temperatures,
> yes it could; but is this really part of CXL? Then how many temperature
> elements and what does each element mean? This enters into the
> implementation and therefore is vendor specific. Unless the CXL spec
> starts to define the implementation, something along the lines of "thou
> shall have an average die temperature, rather than specific temperatures
> across a die", etc.
>
> b) Error counters, metrics, internal counters, etc. Could CXL define a
> set of common error counters, absolutely. PCIe has done some of this.
> However, a specific implementation may have counters and error reporting
> that are meaningful only to a specific design and a specific
> implementation rather than a "least common denominator" approach of a
> standard body.
>
> c) Performance counters, metric, indicators, etc. Performance can be very
> implementation specific and tweaking performance is likely to be
> implementation specific. Yes, generic and a least common denominator
> elements could be created, but are likely to limiting in realizing the
> maximum performance of an implementation.
>
> d) Logs, errors and debug information. In addition to spec defined
> logging of CXL topology errors, specific designs will have logs, crash
> dumps, debug data that is very specific to a implementation. There are
> likely to be cases where a product that conforms to a specification like
> CXL, may have features that don't directly have anything to do with CXL,
> but where a standards based management interface can be used to configure,
> manage, and collect data for a non-CXL feature.

All of the above should be able to be handled by vendor-specific KERNEL
drivers that feed the needed information to the proper user/kernel apis
that the kernel already provides.

So while innovating at the hardware level is fine, follow the ways that
everyone has done this for other specification types (USB, PCI, etc.)
and just allow vendor drivers to provide the information. Don't do this
in crazy userspace drivers which will circumvent the whole reason we
have standard kernel/user apis in the first place for these types of
things.

> e) Innovation. I believe that innovation should be encouraged. There may
> be designs that support CXL, but that also incorporate unique and
> innovative features or functions that might service a niche market. The
> AI space is ripe for innovation and perhaps specialized features that may
> not make sense for the overall CXL specification.
>
> I think that in most cases Vendor specific opcodes are not used to
> circumvent the standards, but are used when the standards group has no
> interested in driving into the standard certain features that are clearly
> either implementation specific or are vendor specific additions that have
> a specific appeal to a select class of customer, but yet are not relevant
> to a specific standard.

Then fight this out in the specification groups, which are highly
political, and do not push that into the kernel space please. Again,
this is nothing new, we have all done this for specs for decades now,
allow vendor additions to the spec and handle that in the kernel and all
should be ok, right?

Or am I missing something obvious here where we would NOT want to do
what all other specs have done?

thanks,

greg k-h

2024-04-27 16:22:47

by Dan Williams

[permalink] [raw]
Subject: Re: RFC: Restricting userspace interfaces for CXL fabric management

Greg KH wrote:
[..]
> So while innovating at the hardware level is fine, follow the ways that
> everyone has done this for other specification types (USB, PCI, etc.)
> and just allow vendor drivers to provide the information. Don't do this
> in crazy userspace drivers which will circumvent the whole reason we
> have standard kernel/user apis in the first place for these types of
> things.

Right, standard kernel/user apis is the requirement.

The suggestion of opaque vendor passthrough tunnels, and every vendor
ships their custom tool to do what should be common flows, is where this
discussion went off the rails.

2024-04-28 04:33:10

by Sirius

[permalink] [raw]
Subject: Re: RFC: Restricting userspace interfaces for CXL fabric management

In days of yore (Sat, 27 Apr 2024), Dan Williams thus quoth:
> Greg KH wrote:
> [..]
> > So while innovating at the hardware level is fine, follow the ways that
> > everyone has done this for other specification types (USB, PCI, etc.)
> > and just allow vendor drivers to provide the information. Don't do this
> > in crazy userspace drivers which will circumvent the whole reason we
> > have standard kernel/user apis in the first place for these types of
> > things.
>
> Right, standard kernel/user apis is the requirement.
>
> The suggestion of opaque vendor passthrough tunnels, and every vendor
> ships their custom tool to do what should be common flows, is where this
> discussion went off the rails.

One aspect of this is Fabric Management (thinking CXL3 here). It is not
desirable that every vendor of CXL hardware require their own
(proprietary) fabric management software. From a user perspective, that is
absolutely horrible. Users can, and will, mix and match CXL hardware
according to their needs (or wallet), and them having to run multiple
fabric management solutions (which in worst case are conflicting with each
other) to manage things is .. suboptimal.

By all means - innovate - but do it in such a way that interoperability
and manageability is the priority. Special sauce vendor lock-in is a
surefire way to kill CXL where it stands - don't do it.

--
Kind regards,

/S

2024-04-29 12:24:32

by Jonathan Cameron

[permalink] [raw]
Subject: Re: RFC: Restricting userspace interfaces for CXL fabric management

On Fri, 26 Apr 2024 14:25:29 -0500
Harold Johnson <[email protected]> wrote:

> Perhaps a bit more color on a few specifics might be helpful.
>
> I think that there will always be a class of vendor specific APIs/Opcodes
> that are related to an implementation of a standard instead of the
> standard itself. I've been party to discussion on not creating CXL
> defined API/Opcodes that get into the realm of specifying an
> implementation. There are also a class of data that can be collected from
> a specific implementation that is helpful for debug, for health
> monitoring, and perhaps performance monitoring where the implementation
> matters and therefore are not easily abstracted to a standard.

Hi Harold,

Let's divert into a few specifics to give some routes to implementing these.
Some of them are extensions of things already well handled. Tweaks
and extensions in the 'spirit' of the existing spec are both places where
adding some richness to the spec is probably not too difficult and where
there may be some flexibility.

In some cases the definitions in the specification almost certainly
came after your design.

>
> A few examples:
> a) Temperature monitoring of a component or internal chip die
> temperatures. Could CXL define a standard OpCode to gather temperatures,
> yes it could; but is this really part of CXL? Then how many temperature
> elements and what does each element mean? This enters into the
> implementation and therefore is vendor specific. Unless the CXL spec
> starts to define the implementation, something along the lines of "thou
> shall have an average die temperature, rather than specific temperatures
> across a die", etc.

There is a general temperature monitoring and trip thresholds etc in the
spec via get Health Info and event logs. That covers a single value,
so not as extensive as what you refer to but it's a start. Whilst you
are right that the specification should not mandate N temperatures in
specific locations, it would be a reasonable request to allow for a
wider set of monitors with some level of description. The intent
being that generic software can discover what is there and present
that info for logging / monitoring.
Whilst the exact meaning will vary by device, some broad scoping is
easy enough (memory controller vs various FRUs perhaps) and that is useful
to providing generic users space software. hmwon has long handled this
sort of data from whole systems where very similar aspects of 'what does
this monitor actually mean' apply. Sometimes that does need device
specific mapping files - so there is precedence that may be helpful here.

>
> b) Error counters, metrics, internal counters, etc. Could CXL define a
> set of common error counters, absolutely. PCIe has done some of this.
> However, a specific implementation may have counters and error reporting
> that are meaningful only to a specific design and a specific
> implementation rather than a "least common denominator" approach of a
> standard body.
>
> c) Performance counters, metric, indicators, etc. Performance can be very
> implementation specific and tweaking performance is likely to be
> implementation specific. Yes, generic and a least common denominator
> elements could be created, but are likely to limiting in realizing the
> maximum performance of an implementation.
These two are somewhat related.

This selection falls into two broad categories.
- Opaque logging and reporting needed just for debug. There are patches
on list for Vendor Debug Log. They haven't merged yet though I think,
so good to get your input on that series. Intent of that feature is
opaque logs. There will never be a useful general tool for that stuff,
conversely no general userspace tools will use them as a result.
It's likely vendor engineer only territory.

- Counters that useful at runtime.
The CXL PMU spec is rich and flexible. In common with CPU PMUs the perf driver
allows for direct specification of events to count. The same issue with
implementation specific counters occurs in CPUs. Whilst you can keep
the meaning of events opaque, if you do you loose out on useable general
purpose software (e.g. perftool). Alternatively some of this can be
pushed into perf tool description files.
The advantage of pushing counters in to the main specification is that
they work out of the box as the CPMU driver will report those directly
(no need for perf tool to work with raw event codes).
At the moment that driver implements part of the CPMU spec. If there are
features you need (free running counters come to mind) then shout
/ patches welcome. We decided to play wait and see with the CPMU driver as
it wasn't clear if the full flexibility was needed for real devices.

>
> d) Logs, errors and debug information. In addition to spec defined
> logging of CXL topology errors, specific designs will have logs, crash
> dumps, debug data that is very specific to a implementation. There are
> likely to be cases where a product that conforms to a specification like
> CXL, may have features that don't directly have anything to do with CXL,
> but where a standards based management interface can be used to configure,
> manage, and collect data for a non-CXL feature.

There are standard interfaces defined for some of this stuff as well.
Last I checked the discussion revolved around component state dump
only being safe to use if the background command abort was supported,
as that was needed to ensure the kernel was not locked out for an indefinite
amount of time.

If you are looking at other standards being run to configure CXL devices
excellent, but those standards should be accessed using whatever the
appropriate kernel interfaces. Could you give some examples for this one?

>
> e) Innovation. I believe that innovation should be encouraged. There may
> be designs that support CXL, but that also incorporate unique and
> innovative features or functions that might service a niche market. The
> AI space is ripe for innovation and perhaps specialized features that may
> not make sense for the overall CXL specification.

Agreed - but those may need specific drivers.

>
> I think that in most cases Vendor specific opcodes are not used to
> circumvent the standards, but are used when the standards group has no
> interested in driving into the standard certain features that are clearly
> either implementation specific or are vendor specific additions that have
> a specific appeal to a select class of customer, but yet are not relevant
> to a specific standard.
>
> At the end of the day, customer want products that solve a specific
> problem. Sometimes vendor can address market segments or niches that a
> standard group has no interest in supporting. It can also take months,
> and in some cases years to reach an agreement on what standardized feature
> should look like. I also believe that there can be competitive reasons
> why there might be a group that wants to slow down a vendor's
> implementation for fear of losing market share.

Whilst I appreciate the way this can slow down adoption / kernel support
it's a path that is still worth it in the end.

As Dan and others have put it, there are other routes than the main
CXL standard. Those should allow tighter collaboration on smaller topics
to define common standards.

Thanks,

Jonathan


>
> Thanks
> Harold Johnson
>
>
> -----Original Message-----
> From: Jonathan Cameron [mailto:[email protected]]
> Sent: Friday, April 26, 2024 11:54 AM
> To: Dan Williams
> Cc: [email protected]; Sreenivas Bagalkote; Brett Henning; Harold
> Johnson; Sumanesh Samanta; [email protected]; Davidlohr Bueso;
> Dave Jiang; Alison Schofield; Vishal Verma; Ira Weiny;
> [email protected]; [email protected]; Lorenzo Pieralisi; Natu,
> Mahesh; [email protected]
> Subject: Re: RFC: Restricting userspace interfaces for CXL fabric
> management
>
> On Fri, 26 Apr 2024 09:16:44 -0700
> Dan Williams <[email protected]> wrote:
>
> > Jonathan Cameron wrote:
> > [..]
> > > To give people an incentive to play the standards game we have to
> > > provide an alternative. Userspace libraries will provide some
> incentive
> > > to standardize if we have enough vendors (we don't today - so they
> will
> > > do their own libraries), but it is a lot easier to encourage if we
> > > exercise control over the interface.
> >
> > Yes, and I expect you and I are not far off on what can be done
> > here.
> >
> > However, lets cut to a sentiment hanging over this discussion. Referring
> > to vendor specific commands:
> >
> > "CXL spec has them for a reason and they need to be supported."
> >
> > ...that is an aggressive "vendor specific first" sentiment that
> > generates an aggressive "userspace drivers" reaction, because the best
> > way to get around community discussions about what ABI makes sense is
> > userspace drivers.
> >
> > Now, if we can step back to where this discussion started, where typical
> > Linux collaboration shines, and where I think you and I are more aligned
> > than this thread would indicate, is "vendor specific last". Lets
> > carefully consider the vendor specific commands that are candidates to
> > be de facto cross vendor semantics if not de jure standards.
> >
>
> Agreed. I'd go a little further and say I generally have much more warm
> and
> fuzzy feelings when what is a vendor defined command (today) maps to more
> or less the same bit of code for a proposed standards ECN.
>
> IP rules prevent us commenting on specific proposals, but there will be
> things we review quicker and with a lighter touch vs others where we
> ask lots of annoying questions about generality of the feature etc.
> Given the effort we are putting in on the kernel side we all want CXL
> to succeed and will do our best to encourage activities that make that
> more likely. There are other standards bodies available... which may
> make more sense for some features.
>
> Command interfaces are not a good place to compete and maintain secrecy.
> If vendors want to do that, then they don't get the pony of upstream
> support. They get to convince distros to do a custom kernel build for
> them:
> Good luck with that, some of those folk are 'blunt' in their responses to
> such requests.
>
> My proposal is we go forward with a bunch of the CXL spec defined commands
> to show the 'how' and consider specific proposals for upstream support
> of vendor defined commands on a case by case basis (so pretty much
> what you say above). Maybe after a few are done we can formalize some
> rules of thumb help vendors makes such proposals, though maybe some
> will figure out it is a better and longer term solution to do 'standards
> first development'.
>
> I think we do need to look at the safety filtering of tunneled
> commands but don't see that as a particularly tricky addition -
> for the simple non destructive commands at least.
>
> Jonathan
>