LinuxLists.cc - RFC: Restricting userspace interfaces for CXL fabric management

2024-03-21 17:55:19

Subject: RFC: Restricting userspace interfaces for CXL fabric management

Hi All,

This is has come up in a number of discussions both on list and in private,
so I wanted to lay out a potential set of rules when deciding whether or not
to provide a user space interface for a particular feature of CXL Fabric
Management. The intent is to drive discussion, not to simply tell people
a set of rules. I've brought this to the public lists as it's a Linux kernel
policy discussion, not a standards one.

Whilst I'm writing the RFC this my attempt to summarize a possible
position rather than necessarily being my personal view.

It's a straw man - shoot at it!

Not everyone in this discussion is familiar with relevant kernel or CXL concepts
so I've provided more info than I normally would.

First some background:
======================

CXL has two different types of Fabric. The comments here refer to both, but
for now the kernel stack is focused on the simpler VCS fabric, not the more
recent Port Based Routing (PBR) Fabrics. A typical example for 2 hosts
connected to a common switch looks something like:

________________ _______________
| | | | Hosts - each sees
| HOST A | | HOST B | a PCIe style tree
| | | | but from a fabric config
| |Root Port| | | |Root Port| | point of view it's more
-------|-------- -------|------- complex.
| |
| |
_______|______________________________|________
| USP (SW-CCI) USP | Switch can have lots of
| | | | Upstream Ports. Each one
| ____|________ _______|______ | has a virtual hierarchy.
| | | | | |
| vPPB vPPB vPPB vPPB| There are virtual
| x | | | | "downstream ports."(vPPBs)
| \ / / | That can be bound to real
| \ / / | downstream ports.
| \ / / |
| \ / / | Multi Logical Devices are
| DSP0 DSP1 DSP 2 | support more than one vPPB
------------------------------------------------ bound to a single physical
| | | DSP (transactions are tagged
| | | with an LD-ID)
SLD0 MLD0 SLD1

Some typical fabric management activities:
1) Bind/Unbind vPPB to physical DSP (Results in hotplug / unplug events)
2) Access config space or BAR space of End Points below the switch.
3) Tunneling messages through to devices downstream (e.g Dynamic Capacity
Forced Remove that will blow away some memory even if a host is using it).
4) Non destructive stuff like status read back.

Given the hosts may be using the Type 3 hosted memory (either Single Logical
Device - SLD, or an LD on a Multi logical Device - MLD) as normal memory,
unbinding a device in use can result in the memory access from a
different host being removed. The 'blast radius' is perhaps a rack of
servers. This discussion applies equally to FM-API commands sent to Multi
Head Devices (see CXL r3.1).

The Fabric Management actions are done using the CXL spec defined Fabric
Management API, (FM-API) which is transported over various means including
OoB MCTP over your favourite transport (I2C, PCIe-VDM...) or via normal
PCIe read/write to a Switch-CCI. A Switch-CCI is mailbox in PCI BAR
space on a function found alongside one of the switch upstream ports;
this mailbox is very similar to the MMPT definition found in PCIe r6.2.

In many cases this switch CCI / MCTP connection is used by a BMC rather
than a normal host, but there have been some questions raised about whether
a general purpose server OS would have a valid reason to use this interface
(beyond debug and testing) to configure the switch or an MHD.

If people have a use case for this, please reply to this thread to give
more details.

The most recently posted CXL Switch-CCI support only provided the RAW CXL
command IOCTL interface that is already available for Type 3 memory devices.
That allows for unfettered control of the switch but, because it is
extremely easy to shoot yourself in the foot and cause unsolvable bug reports,
it taints the kernel. There have been several requests to provide this interface
without the taint for these switch configuration mailboxes.

Last posted series:
https://lore.kernel.org/all/[email protected]/
Note there are unrelated reasons why that code hasn't been updated since v6.6 time,
but I am planning to get back to it shortly.

Similar issues will occur for other uses of PCIe MMPT (new mailbox in PCI that
sometimes is used for similarly destructive activity such as PLDM based
firmware update).

On to the proposed rules:

1) Kernel space use of the various mailboxes, or filtered controls from user space.
==================================================================================

Absolutely fine - no one worries about this, but the mediated traffic will
be filtered for potentially destructive side effects. E.g. it will reject
attempts to change anything routing related if the kernel either knows a host is
using memory that will be blown away, or has no way to know (so affecting
routing to another host). This includes blocking 'all' vendor defined
messages as we have no idea what the do. Note this means the kernel has
an allow list and new commands are not initially allowed.

This isn't currently enabled for Switch CCIs because they are only really
interesting if the potentially destructive stuff is available (an earlier
version did enable query commands, but it wasn't particularly useful to
know what your switch could do but not be allowed to do any of it).
If you take a MMPT usecase of PLDM firmware update, the filtering would
check that the device was in a state where a firmware update won't rip
memory out from under a host, which would be messy if that host is
doing the update.

2) Unfiltered userspace use of mailbox for Fabric Management - BMC kernels
==========================================================================

(This would just be a kernel option that we'd advise normal server
distributions not to turn on. Would be enabled by openBMC etc)

This is fine - there is some work to do, but the switch-cci PCI driver
will hopefully be ready for upstream merge soon. There is no filtering of
accesses. Think of this as similar to all the damage you can do via
MCTP from a BMC. Similarly it is likely that much of the complexity
of the actual commands will be left to user space tooling:
https://gitlab.com/jic23/cxl-fmapi-tests has some test examples.

Whether Kconfig help text is strong enough to ensure this only gets
enabled for BMC targeted distros is an open question we can address
alongside an updated patch set.

(On to the one that the "debate" is about)

3) Unfiltered user space use of mailbox for Fabric Management - Distro kernels
=============================================================================
(General purpose Linux Server Distro (Redhat, Suse etc))

This is equivalent of RAW command support on CXL Type 3 memory devices.
You can enable those in a distro kernel build despite the scary config
help text, but if you use it the kernel is tainted. The result
of the taint is to add a flag to bug reports and print a big message to say
that you've used a feature that might result in you shooting yourself
in the foot.

The taint is there because software is not at first written to deal with
everything that can happen smoothly (e.g. surprise removal) It's hard
to survive some of these events, so is never on the initial feature list
for any bus, so this flag is just to indicate we have entered a world
where almost all bets are off wrt to stability. We might not know what
a command does so we can't assess the impact (and no one trusts vendor
commands to report affects right in the Command Effects Log - which
in theory tells you if a command can result problems).

A concern was raised about GAE/FAST/LDST tables for CXL Fabrics
(a r3.1 feature) but, as I understand it, these are intended for a
host to configure and should not have side effects on other hosts?
My working assumption is that the kernel driver stack will handle
these (once we catch up with the current feature backlog!) Currently
we have no visibility of what the OS driver stack for a fabrics will
actually look like - the spec is just the starting point for that.
(patches welcome ;)

The various CXL upstream developers and maintainers may have
differing views of course, but my current understanding is we want
to support 1 and 2, but are very resistant to 3!

General Notes
=============

One side aspect of why we really don't like unfiltered userspace access to any
of these devices is that people start building non standard hacks in and we
lose the ecosystem advantages. Forcing a considered discussion + patches
to let a particular command be supported, drives standardization.

https://lore.kernel.org/linux-cxl/CAPcyv4gDShAYih5iWabKg_eTHhuHm54vEAei8ZkcmHnPp3B0cw@mail.gmail.com/
provides some history on vendor specific extensions and why in general we
won't support them upstream.

To address another question raised in an earlier discussion:
Putting these Fabric Management interfaces behind guard rails of some type
(e.g. CONFIG_IM_A_BMC_AND_CAN_MAKE_A_MESS) does not encourage the risk
of non standard interfaces, because we will be even less likely to accept
those upstream!

If anyone needs more details on any aspect of this please ask.
There are a lot of things involved and I've only tried to give a fairly
minimal illustration to drive the discussion. I may well have missed
something crucial.

Jonathan

2024-03-21 21:41:45

On Fri, 26 Apr 2024 09:16:44 -0700
Dan Williams <[email protected]> wrote:

> Jonathan Cameron wrote:
> [..]
> > To give people an incentive to play the standards game we have to
> > provide an alternative. Userspace libraries will provide some incentive
> > to standardize if we have enough vendors (we don't today - so they will
> > do their own libraries), but it is a lot easier to encourage if we
> > exercise control over the interface.
>
> Yes, and I expect you and I are not far off on what can be done
> here.
>
> However, lets cut to a sentiment hanging over this discussion. Referring
> to vendor specific commands:
>
> "CXL spec has them for a reason and they need to be supported."
>
> ...that is an aggressive "vendor specific first" sentiment that
> generates an aggressive "userspace drivers" reaction, because the best
> way to get around community discussions about what ABI makes sense is
> userspace drivers.
>
> Now, if we can step back to where this discussion started, where typical
> Linux collaboration shines, and where I think you and I are more aligned
> than this thread would indicate, is "vendor specific last". Lets
> carefully consider the vendor specific commands that are candidates to
> be de facto cross vendor semantics if not de jure standards.
>

Agreed. I'd go a little further and say I generally have much more warm and
fuzzy feelings when what is a vendor defined command (today) maps to more
or less the same bit of code for a proposed standards ECN.

IP rules prevent us commenting on specific proposals, but there will be
things we review quicker and with a lighter touch vs others where we
ask lots of annoying questions about generality of the feature etc.
Given the effort we are putting in on the kernel side we all want CXL
to succeed and will do our best to encourage activities that make that
more likely. There are other standards bodies available... which may
make more sense for some features.

Command interfaces are not a good place to compete and maintain secrecy.
If vendors want to do that, then they don't get the pony of upstream
support. They get to convince distros to do a custom kernel build for them:
Good luck with that, some of those folk are 'blunt' in their responses to
such requests.

My proposal is we go forward with a bunch of the CXL spec defined commands
to show the 'how' and consider specific proposals for upstream support
of vendor defined commands on a case by case basis (so pretty much
what you say above). Maybe after a few are done we can formalize some
rules of thumb help vendors makes such proposals, though maybe some
will figure out it is a better and longer term solution to do 'standards
first development'.

I think we do need to look at the safety filtering of tunneled
commands but don't see that as a particularly tricky addition -
for the simple non destructive commands at least.

Jonathan

2024-04-26 20:41:19

On Fri, 26 Apr 2024 14:25:29 -0500
Harold Johnson <[email protected]> wrote:

> Perhaps a bit more color on a few specifics might be helpful.
>
> I think that there will always be a class of vendor specific APIs/Opcodes
> that are related to an implementation of a standard instead of the
> standard itself. I've been party to discussion on not creating CXL
> defined API/Opcodes that get into the realm of specifying an
> implementation. There are also a class of data that can be collected from
> a specific implementation that is helpful for debug, for health
> monitoring, and perhaps performance monitoring where the implementation
> matters and therefore are not easily abstracted to a standard.

Hi Harold,

Let's divert into a few specifics to give some routes to implementing these.
Some of them are extensions of things already well handled. Tweaks
and extensions in the 'spirit' of the existing spec are both places where
adding some richness to the spec is probably not too difficult and where
there may be some flexibility.

In some cases the definitions in the specification almost certainly
came after your design.

>
> A few examples:
> a) Temperature monitoring of a component or internal chip die
> temperatures. Could CXL define a standard OpCode to gather temperatures,
> yes it could; but is this really part of CXL? Then how many temperature
> elements and what does each element mean? This enters into the
> implementation and therefore is vendor specific. Unless the CXL spec
> starts to define the implementation, something along the lines of "thou
> shall have an average die temperature, rather than specific temperatures
> across a die", etc.

There is a general temperature monitoring and trip thresholds etc in the
spec via get Health Info and event logs. That covers a single value,
so not as extensive as what you refer to but it's a start. Whilst you
are right that the specification should not mandate N temperatures in
specific locations, it would be a reasonable request to allow for a
wider set of monitors with some level of description. The intent
being that generic software can discover what is there and present
that info for logging / monitoring.
Whilst the exact meaning will vary by device, some broad scoping is
easy enough (memory controller vs various FRUs perhaps) and that is useful
to providing generic users space software. hmwon has long handled this
sort of data from whole systems where very similar aspects of 'what does
this monitor actually mean' apply. Sometimes that does need device
specific mapping files - so there is precedence that may be helpful here.

>
> b) Error counters, metrics, internal counters, etc. Could CXL define a
> set of common error counters, absolutely. PCIe has done some of this.
> However, a specific implementation may have counters and error reporting
> that are meaningful only to a specific design and a specific
> implementation rather than a "least common denominator" approach of a
> standard body.
>
> c) Performance counters, metric, indicators, etc. Performance can be very
> implementation specific and tweaking performance is likely to be
> implementation specific. Yes, generic and a least common denominator
> elements could be created, but are likely to limiting in realizing the
> maximum performance of an implementation.
These two are somewhat related.

This selection falls into two broad categories.
- Opaque logging and reporting needed just for debug. There are patches
on list for Vendor Debug Log. They haven't merged yet though I think,
so good to get your input on that series. Intent of that feature is
opaque logs. There will never be a useful general tool for that stuff,
conversely no general userspace tools will use them as a result.
It's likely vendor engineer only territory.

- Counters that useful at runtime.
The CXL PMU spec is rich and flexible. In common with CPU PMUs the perf driver
allows for direct specification of events to count. The same issue with
implementation specific counters occurs in CPUs. Whilst you can keep
the meaning of events opaque, if you do you loose out on useable general
purpose software (e.g. perftool). Alternatively some of this can be
pushed into perf tool description files.
The advantage of pushing counters in to the main specification is that
they work out of the box as the CPMU driver will report those directly
(no need for perf tool to work with raw event codes).
At the moment that driver implements part of the CPMU spec. If there are
features you need (free running counters come to mind) then shout
/ patches welcome. We decided to play wait and see with the CPMU driver as
it wasn't clear if the full flexibility was needed for real devices.

>
> d) Logs, errors and debug information. In addition to spec defined
> logging of CXL topology errors, specific designs will have logs, crash
> dumps, debug data that is very specific to a implementation. There are
> likely to be cases where a product that conforms to a specification like
> CXL, may have features that don't directly have anything to do with CXL,
> but where a standards based management interface can be used to configure,
> manage, and collect data for a non-CXL feature.

There are standard interfaces defined for some of this stuff as well.
Last I checked the discussion revolved around component state dump
only being safe to use if the background command abort was supported,
as that was needed to ensure the kernel was not locked out for an indefinite
amount of time.

If you are looking at other standards being run to configure CXL devices
excellent, but those standards should be accessed using whatever the
appropriate kernel interfaces. Could you give some examples for this one?

>
> e) Innovation. I believe that innovation should be encouraged. There may
> be designs that support CXL, but that also incorporate unique and
> innovative features or functions that might service a niche market. The
> AI space is ripe for innovation and perhaps specialized features that may
> not make sense for the overall CXL specification.

Agreed - but those may need specific drivers.

>
> I think that in most cases Vendor specific opcodes are not used to
> circumvent the standards, but are used when the standards group has no
> interested in driving into the standard certain features that are clearly
> either implementation specific or are vendor specific additions that have
> a specific appeal to a select class of customer, but yet are not relevant
> to a specific standard.
>
> At the end of the day, customer want products that solve a specific
> problem. Sometimes vendor can address market segments or niches that a
> standard group has no interest in supporting. It can also take months,
> and in some cases years to reach an agreement on what standardized feature
> should look like. I also believe that there can be competitive reasons
> why there might be a group that wants to slow down a vendor's
> implementation for fear of losing market share.

Whilst I appreciate the way this can slow down adoption / kernel support
it's a path that is still worth it in the end.

As Dan and others have put it, there are other routes than the main
CXL standard. Those should allow tighter collaboration on smaller topics
to define common standards.

Thanks,

Jonathan

>
> Thanks
> Harold Johnson
>
>
> -----Original Message-----
> From: Jonathan Cameron [mailto:[email protected]]
> Sent: Friday, April 26, 2024 11:54 AM
> To: Dan Williams
> Cc: [email protected]; Sreenivas Bagalkote; Brett Henning; Harold
> Johnson; Sumanesh Samanta; [email protected]; Davidlohr Bueso;
> Dave Jiang; Alison Schofield; Vishal Verma; Ira Weiny;
> [email protected]; [email protected]; Lorenzo Pieralisi; Natu,
> Mahesh; [email protected]
> Subject: Re: RFC: Restricting userspace interfaces for CXL fabric
> management
>
> On Fri, 26 Apr 2024 09:16:44 -0700
> Dan Williams <[email protected]> wrote:
>
> > Jonathan Cameron wrote:
> > [..]
> > > To give people an incentive to play the standards game we have to
> > > provide an alternative. Userspace libraries will provide some
> incentive
> > > to standardize if we have enough vendors (we don't today - so they
> will
> > > do their own libraries), but it is a lot easier to encourage if we
> > > exercise control over the interface.
> >
> > Yes, and I expect you and I are not far off on what can be done
> > here.
> >
> > However, lets cut to a sentiment hanging over this discussion. Referring
> > to vendor specific commands:
> >
> > "CXL spec has them for a reason and they need to be supported."
> >
> > ...that is an aggressive "vendor specific first" sentiment that
> > generates an aggressive "userspace drivers" reaction, because the best
> > way to get around community discussions about what ABI makes sense is
> > userspace drivers.
> >
> > Now, if we can step back to where this discussion started, where typical
> > Linux collaboration shines, and where I think you and I are more aligned
> > than this thread would indicate, is "vendor specific last". Lets
> > carefully consider the vendor specific commands that are candidates to
> > be de facto cross vendor semantics if not de jure standards.
> >
>
> Agreed. I'd go a little further and say I generally have much more warm
> and
> fuzzy feelings when what is a vendor defined command (today) maps to more
> or less the same bit of code for a proposed standards ECN.
>
> IP rules prevent us commenting on specific proposals, but there will be
> things we review quicker and with a lighter touch vs others where we
> ask lots of annoying questions about generality of the feature etc.
> Given the effort we are putting in on the kernel side we all want CXL
> to succeed and will do our best to encourage activities that make that
> more likely. There are other standards bodies available... which may
> make more sense for some features.
>
> Command interfaces are not a good place to compete and maintain secrecy.
> If vendors want to do that, then they don't get the pony of upstream
> support. They get to convince distros to do a custom kernel build for
> them:
> Good luck with that, some of those folk are 'blunt' in their responses to
> such requests.
>
> My proposal is we go forward with a bunch of the CXL spec defined commands
> to show the 'how' and consider specific proposals for upstream support
> of vendor defined commands on a case by case basis (so pretty much
> what you say above). Maybe after a few are done we can formalize some
> rules of thumb help vendors makes such proposals, though maybe some
> will figure out it is a better and longer term solution to do 'standards
> first development'.
>
> I think we do need to look at the safety filtering of tunneled
> commands but don't see that as a particularly tricky addition -
> for the simple non destructive commands at least.
>
> Jonathan
>