Hi all,
This series adds PCIe bandwidth controller (bwctrl) and associated PCIe
cooling driver to the thermal core side for limiting PCIe Link Speed
due to thermal reasons. PCIe bandwidth controller is a PCI express bus
port service driver. A cooling device is created for each port the
service driver finds if they support changing speeds.
This series only adds support for controlling PCIe Link Speed.
Controlling PCIe Link Width might also be useful but AFAIK, there is no
mechanism for that until PCIe 6.0 (L0p). Based on feedback for v1, the
thermal/cooling device side prefers Link Speed and Link Width to be
separate cooling devices [1] which is taken into account in naming the
cooling device for Link Speed but the Link Width one is not added yet
as it would not be able to control anything at the moment.
bwctrl is built on top of BW notifications revert. I'm just not sure
what is the best practice when re-adding some old functionality in a
modified form so please let me know if I need to somehow alter that
patch.
[1] https://lore.kernel.org/linux-pci/[email protected]/
v3:
- Correct hfi1 shortlog prefix
- Improve error prints in hfi1
- Add L: linux-pci to the MAINTAINERS entry
v2:
- Adds LNKCTL2 to RMW safe list in Documentation/PCI/pciebus-howto.rst
- Renamed cooling devices from PCIe_Port_* to PCIe_Port_Link_Speed_* in
order to plan for possibility of adding Link Width cooling devices
later on
- Moved struct thermal_cooling_device declaration to the correct patch
- Small tweaks to Kconfig texts
- Series rebased to resolve conflict (in the selftest list)
Ilpo Järvinen (10):
PCI: Protect Link Control 2 Register with RMW locking
drm/radeon: Use RMW accessors for changing LNKCTL2
drm/amdgpu: Use RMW accessors for changing LNKCTL2
RDMA/hfi1: Use RMW accessors for changing LNKCTL2
PCI: Store all PCIe Supported Link Speeds
PCI: Cache PCIe device's Supported Speed Vector
PCI/LINK: Re-add BW notification portdrv as PCIe BW controller
PCI/bwctrl: Add "controller" part into PCIe bwctrl
thermal: Add PCIe cooling driver
selftests/pcie_bwctrl: Create selftests
Documentation/PCI/pciebus-howto.rst | 8 +-
MAINTAINERS | 9 +
drivers/gpu/drm/amd/amdgpu/cik.c | 41 +--
drivers/gpu/drm/amd/amdgpu/si.c | 41 +--
drivers/gpu/drm/radeon/cik.c | 40 +--
drivers/gpu/drm/radeon/si.c | 40 +--
drivers/infiniband/hw/hfi1/pcie.c | 30 +-
drivers/pci/pcie/Kconfig | 9 +
drivers/pci/pcie/Makefile | 1 +
drivers/pci/pcie/bwctrl.c | 309 ++++++++++++++++++
drivers/pci/pcie/portdrv.c | 9 +-
drivers/pci/pcie/portdrv.h | 10 +-
drivers/pci/probe.c | 38 ++-
drivers/pci/remove.c | 2 +
drivers/thermal/Kconfig | 10 +
drivers/thermal/Makefile | 2 +
drivers/thermal/pcie_cooling.c | 107 ++++++
include/linux/pci-bwctrl.h | 33 ++
include/linux/pci.h | 3 +
include/uapi/linux/pci_regs.h | 1 +
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/pcie_bwctrl/Makefile | 2 +
.../pcie_bwctrl/set_pcie_cooling_state.sh | 122 +++++++
.../selftests/pcie_bwctrl/set_pcie_speed.sh | 67 ++++
24 files changed, 800 insertions(+), 135 deletions(-)
create mode 100644 drivers/pci/pcie/bwctrl.c
create mode 100644 drivers/thermal/pcie_cooling.c
create mode 100644 include/linux/pci-bwctrl.h
create mode 100644 tools/testing/selftests/pcie_bwctrl/Makefile
create mode 100755 tools/testing/selftests/pcie_bwctrl/set_pcie_cooling_state.sh
create mode 100755 tools/testing/selftests/pcie_bwctrl/set_pcie_speed.sh
--
2.30.2
Add "controller" parts into PCIe bwctrl for limiting PCIe Link Speed
(due to thermal reasons).
PCIe bandwidth controller introduces an in-kernel API to set PCIe Link
Speed. This new API is intended to be used in an upcoming commit that
adds a thermal cooling device to throttle PCIe bandwidth when thermal
thresholds are reached. No users are introduced in this commit yet.
The PCIe bandwidth control procedure is as follows. The requested speed
is validated against Link Speeds supported by the port and downstream
device. Then bandwidth controller sets the Target Link Speed in the
Link Control 2 Register and retrains the PCIe Link.
Bandwidth notifications enable the cur_bus_speed in the struct pci_bus
to keep track PCIe Link Speed changes. This keeps the link speed seen
through sysfs correct (both for PCI device and thermal cooling device).
While bandwidth notifications should also be generated when bandwidth
controller alters the PCIe Link Speed, a few platforms do not deliver
LMBS interrupt after Link Training as expected. Thus, after changing
the Link Speed, bandwidth controller makes additional read for the Link
Status Register to ensure cur_bus_speed is consistent with the new PCIe
Link Speed.
Signed-off-by: Ilpo Järvinen <[email protected]>
---
MAINTAINERS | 7 ++
drivers/pci/pcie/Kconfig | 9 +-
drivers/pci/pcie/bwctrl.c | 177 +++++++++++++++++++++++++++++++++++--
include/linux/pci-bwctrl.h | 17 ++++
4 files changed, 201 insertions(+), 9 deletions(-)
create mode 100644 include/linux/pci-bwctrl.h
diff --git a/MAINTAINERS b/MAINTAINERS
index 90f13281d297..2e4ad074eaf3 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -16569,6 +16569,13 @@ F: include/linux/pci*
F: include/uapi/linux/pci*
F: lib/pci*
+PCIE BANDWIDTH CONTROLLER
+M: Ilpo Järvinen <[email protected]>
+L: [email protected]
+S: Supported
+F: drivers/pci/pcie/bwctrl.c
+F: include/linux/pci-bwctrl.h
+
PCIE DRIVER FOR AMAZON ANNAPURNA LABS
M: Jonathan Chocron <[email protected]>
L: [email protected]
diff --git a/drivers/pci/pcie/Kconfig b/drivers/pci/pcie/Kconfig
index 1ef8073fa89a..1c6509cf169a 100644
--- a/drivers/pci/pcie/Kconfig
+++ b/drivers/pci/pcie/Kconfig
@@ -138,12 +138,13 @@ config PCIE_PTM
is safe to enable even if you don't.
config PCIE_BW
- bool "PCI Express Bandwidth Change Notification"
+ bool "PCI Express Bandwidth Controller"
depends on PCIEPORTBUS
help
- This enables PCI Express Bandwidth Change Notification. If
- you know link width or rate changes occur to correct unreliable
- links, you may answer Y.
+ This enables PCI Express Bandwidth Controller. The Bandwidth
+ Controller allows controlling PCIe link speed and listens for link
+ peed Change Notifications. If you know link width or rate changes
+ occur to correct unreliable links, you may answer Y.
config PCIE_EDR
bool "PCI Express Error Disconnect Recover support"
diff --git a/drivers/pci/pcie/bwctrl.c b/drivers/pci/pcie/bwctrl.c
index 4fc6718fc0e5..e3172d69476f 100644
--- a/drivers/pci/pcie/bwctrl.c
+++ b/drivers/pci/pcie/bwctrl.c
@@ -1,14 +1,16 @@
// SPDX-License-Identifier: GPL-2.0+
/*
- * PCI Express Link Bandwidth Notification services driver
+ * PCIe bandwidth controller
+ *
* Author: Alexandru Gagniuc <[email protected]>
*
* Copyright (C) 2019, Dell Inc
+ * Copyright (C) 2023 Intel Corporation.
*
- * The PCIe Link Bandwidth Notification provides a way to notify the
- * operating system when the link width or data rate changes. This
- * capability is required for all root ports and downstream ports
- * supporting links wider than x1 and/or multiple link speeds.
+ * The PCIe Bandwidth Controller provides a way to alter PCIe link speeds
+ * and notify the operating system when the link width or data rate changes.
+ * The notification capability is required for all Root Ports and Downstream
+ * Ports supporting links wider than x1 and/or multiple link speeds.
*
* This service port driver hooks into the bandwidth notification interrupt
* watching for link speed changes or links becoming degraded in operation
@@ -17,9 +19,48 @@
#define dev_fmt(fmt) "bwctrl: " fmt
+#include <linux/bitops.h>
+#include <linux/errno.h>
+#include <linux/mutex.h>
+#include <linux/pci.h>
+#include <linux/pci-bwctrl.h>
+#include <linux/types.h>
+
+#include <asm/rwonce.h>
+
#include "../pci.h"
#include "portdrv.h"
+/**
+ * struct bwctrl_service_data - PCIe Port Bandwidth Controller
+ * @set_speed_mutex: serializes link speed changes
+ */
+struct bwctrl_service_data {
+ struct mutex set_speed_mutex;
+};
+
+static bool bwctrl_valid_pcie_speed(enum pci_bus_speed speed)
+{
+ return (speed >= PCIE_SPEED_2_5GT) && (speed <= PCIE_SPEED_64_0GT);
+}
+
+static u16 speed2lnkctl2(enum pci_bus_speed speed)
+{
+ static const u16 speed_conv[] = {
+ [PCIE_SPEED_2_5GT] = PCI_EXP_LNKCTL2_TLS_2_5GT,
+ [PCIE_SPEED_5_0GT] = PCI_EXP_LNKCTL2_TLS_5_0GT,
+ [PCIE_SPEED_8_0GT] = PCI_EXP_LNKCTL2_TLS_8_0GT,
+ [PCIE_SPEED_16_0GT] = PCI_EXP_LNKCTL2_TLS_16_0GT,
+ [PCIE_SPEED_32_0GT] = PCI_EXP_LNKCTL2_TLS_32_0GT,
+ [PCIE_SPEED_64_0GT] = PCI_EXP_LNKCTL2_TLS_64_0GT,
+ };
+
+ if (WARN_ON_ONCE(!bwctrl_valid_pcie_speed(speed)))
+ return 0;
+
+ return speed_conv[speed];
+}
+
static bool pcie_link_bandwidth_notification_supported(struct pci_dev *dev)
{
int ret;
@@ -77,8 +118,118 @@ static irqreturn_t pcie_bw_notification_irq(int irq, void *context)
return IRQ_HANDLED;
}
+/* Configure target speed to the requested speed and set train link */
+static int bwctrl_set_speed(struct pci_dev *port, u16 lnkctl2_speed)
+{
+ int ret;
+
+ ret = pcie_capability_clear_and_set_word(port, PCI_EXP_LNKCTL2,
+ PCI_EXP_LNKCTL2_TLS, lnkctl2_speed);
+ if (ret != PCIBIOS_SUCCESSFUL)
+ return pcibios_err_to_errno(ret);
+
+ return 0;
+}
+
+static int bwctrl_select_speed(struct pcie_device *srv, enum pci_bus_speed *speed)
+{
+ struct pci_bus *bus = srv->port->subordinate;
+ u8 speeds, dev_speeds;
+ int i;
+
+ if (*speed > PCIE_LNKCAP2_SLS2SPEED(bus->pcie_bus_speeds))
+ return -EINVAL;
+
+ dev_speeds = READ_ONCE(bus->pcie_dev_speeds);
+ /* Only the lowest speed can be set when there are no devices */
+ if (!dev_speeds)
+ dev_speeds = PCI_EXP_LNKCAP2_SLS_2_5GB;
+
+ /*
+ * Implementation Note in PCIe r6.0.1 sec 7.5.3.18 recommends OS to
+ * utilize Supported Link Speeds vector for determining which link
+ * speeds are supported.
+ *
+ * Take into account Supported Link Speeds both from the Root Port
+ * and the device.
+ */
+ speeds = bus->pcie_bus_speeds & dev_speeds;
+ i = BIT(fls(speeds));
+ while (i >= PCI_EXP_LNKCAP2_SLS_2_5GB) {
+ enum pci_bus_speed candidate;
+
+ if (speeds & i) {
+ candidate = PCIE_LNKCAP2_SLS2SPEED(i);
+ if (candidate <= *speed) {
+ *speed = candidate;
+ return 0;
+ }
+ }
+ i >>= 1;
+ }
+
+ return -EINVAL;
+}
+
+/**
+ * bwctrl_set_current_speed - Set downstream link speed for PCIe port
+ * @srv: PCIe port
+ * @speed: PCIe bus speed to set
+ *
+ * Attempts to set PCIe port link speed to @speed. As long as @speed is less
+ * than the maximum of what is supported by @srv, the speed is adjusted
+ * downwards to the best speed supported by both the port and device
+ * underneath it.
+ *
+ * Return:
+ * * 0 - on success
+ * * -EINVAL - @speed is higher than the maximum @srv supports
+ * * -ETIMEDOUT - changing link speed took too long
+ * * -EAGAIN - link speed was changed but @speed was not achieved
+ */
+int bwctrl_set_current_speed(struct pcie_device *srv, enum pci_bus_speed speed)
+{
+ struct bwctrl_service_data *data = get_service_data(srv);
+ struct pci_dev *port = srv->port;
+ u16 link_status;
+ int ret;
+
+ if (WARN_ON_ONCE(!bwctrl_valid_pcie_speed(speed)))
+ return -EINVAL;
+
+ ret = bwctrl_select_speed(srv, &speed);
+ if (ret < 0)
+ return ret;
+
+ mutex_lock(&data->set_speed_mutex);
+ ret = bwctrl_set_speed(port, speed2lnkctl2(speed));
+ if (ret < 0)
+ goto unlock;
+
+ ret = pcie_retrain_link(port, true);
+ if (ret < 0)
+ goto unlock;
+
+ /*
+ * Ensure link speed updates also with platforms that have problems
+ * with notifications
+ */
+ ret = pcie_capability_read_word(port, PCI_EXP_LNKSTA, &link_status);
+ if (ret == PCIBIOS_SUCCESSFUL)
+ pcie_update_link_speed(port->subordinate, link_status);
+
+ if (port->subordinate->cur_bus_speed != speed)
+ ret = -EAGAIN;
+
+unlock:
+ mutex_unlock(&data->set_speed_mutex);
+
+ return ret;
+}
+
static int pcie_bandwidth_notification_probe(struct pcie_device *srv)
{
+ struct bwctrl_service_data *data;
struct pci_dev *port = srv->port;
int ret;
@@ -91,16 +242,32 @@ static int pcie_bandwidth_notification_probe(struct pcie_device *srv)
if (ret)
return ret;
+ data = kzalloc(sizeof(*data), GFP_KERNEL);
+ if (!data) {
+ ret = -ENOMEM;
+ goto free_irq;
+ }
+ mutex_init(&data->set_speed_mutex);
+ set_service_data(srv, data);
+
pcie_enable_link_bandwidth_notification(port);
pci_info(port, "enabled with IRQ %d\n", srv->irq);
return 0;
+
+free_irq:
+ free_irq(srv->irq, srv);
+ return ret;
}
static void pcie_bandwidth_notification_remove(struct pcie_device *srv)
{
+ struct bwctrl_service_data *data = get_service_data(srv);
+
pcie_disable_link_bandwidth_notification(srv->port);
free_irq(srv->irq, srv);
+ mutex_destroy(&data->set_speed_mutex);
+ kfree(data);
}
static int pcie_bandwidth_notification_suspend(struct pcie_device *srv)
diff --git a/include/linux/pci-bwctrl.h b/include/linux/pci-bwctrl.h
new file mode 100644
index 000000000000..8eae09bd03b5
--- /dev/null
+++ b/include/linux/pci-bwctrl.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * PCIe bandwidth controller
+ *
+ * Copyright (C) 2023 Intel Corporation.
+ */
+
+#ifndef LINUX_PCI_BWCTRL_H
+#define LINUX_PCI_BWCTRL_H
+
+#include <linux/pci.h>
+
+struct pcie_device;
+
+int bwctrl_set_current_speed(struct pcie_device *srv, enum pci_bus_speed speed);
+
+#endif
--
2.30.2
Don't assume that only the driver would be accessing LNKCTL2. In the
case of upstream (parent), the driver does not even own the device it's
changing the registers for.
Use RMW capability accessors which do proper locking to avoid losing
concurrent updates to the register value. This change is also useful as
a cleanup.
Suggested-by: Lukas Wunner <[email protected]>
Signed-off-by: Ilpo Järvinen <[email protected]>
---
drivers/gpu/drm/amd/amdgpu/cik.c | 41 ++++++++++++--------------------
drivers/gpu/drm/amd/amdgpu/si.c | 41 ++++++++++++--------------------
2 files changed, 30 insertions(+), 52 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/cik.c b/drivers/gpu/drm/amd/amdgpu/cik.c
index e63abdf52b6c..7bcd41996927 100644
--- a/drivers/gpu/drm/amd/amdgpu/cik.c
+++ b/drivers/gpu/drm/amd/amdgpu/cik.c
@@ -1638,28 +1638,18 @@ static void cik_pcie_gen3_enable(struct amdgpu_device *adev)
PCI_EXP_LNKCTL_HAWD);
/* linkctl2 */
- pcie_capability_read_word(root, PCI_EXP_LNKCTL2,
- &tmp16);
- tmp16 &= ~(PCI_EXP_LNKCTL2_ENTER_COMP |
- PCI_EXP_LNKCTL2_TX_MARGIN);
- tmp16 |= (bridge_cfg2 &
- (PCI_EXP_LNKCTL2_ENTER_COMP |
- PCI_EXP_LNKCTL2_TX_MARGIN));
- pcie_capability_write_word(root,
- PCI_EXP_LNKCTL2,
- tmp16);
-
- pcie_capability_read_word(adev->pdev,
- PCI_EXP_LNKCTL2,
- &tmp16);
- tmp16 &= ~(PCI_EXP_LNKCTL2_ENTER_COMP |
- PCI_EXP_LNKCTL2_TX_MARGIN);
- tmp16 |= (gpu_cfg2 &
- (PCI_EXP_LNKCTL2_ENTER_COMP |
- PCI_EXP_LNKCTL2_TX_MARGIN));
- pcie_capability_write_word(adev->pdev,
- PCI_EXP_LNKCTL2,
- tmp16);
+ pcie_capability_clear_and_set_word(root, PCI_EXP_LNKCTL2,
+ PCI_EXP_LNKCTL2_ENTER_COMP |
+ PCI_EXP_LNKCTL2_TX_MARGIN,
+ bridge_cfg2 &
+ (PCI_EXP_LNKCTL2_ENTER_COMP |
+ PCI_EXP_LNKCTL2_TX_MARGIN));
+ pcie_capability_clear_and_set_word(adev->pdev, PCI_EXP_LNKCTL2,
+ PCI_EXP_LNKCTL2_ENTER_COMP |
+ PCI_EXP_LNKCTL2_TX_MARGIN,
+ gpu_cfg2 &
+ (PCI_EXP_LNKCTL2_ENTER_COMP |
+ PCI_EXP_LNKCTL2_TX_MARGIN));
tmp = RREG32_PCIE(ixPCIE_LC_CNTL4);
tmp &= ~PCIE_LC_CNTL4__LC_SET_QUIESCE_MASK;
@@ -1674,16 +1664,15 @@ static void cik_pcie_gen3_enable(struct amdgpu_device *adev)
speed_cntl &= ~PCIE_LC_SPEED_CNTL__LC_FORCE_DIS_SW_SPEED_CHANGE_MASK;
WREG32_PCIE(ixPCIE_LC_SPEED_CNTL, speed_cntl);
- pcie_capability_read_word(adev->pdev, PCI_EXP_LNKCTL2, &tmp16);
- tmp16 &= ~PCI_EXP_LNKCTL2_TLS;
-
+ tmp16 = 0;
if (adev->pm.pcie_gen_mask & CAIL_PCIE_LINK_SPEED_SUPPORT_GEN3)
tmp16 |= PCI_EXP_LNKCTL2_TLS_8_0GT; /* gen3 */
else if (adev->pm.pcie_gen_mask & CAIL_PCIE_LINK_SPEED_SUPPORT_GEN2)
tmp16 |= PCI_EXP_LNKCTL2_TLS_5_0GT; /* gen2 */
else
tmp16 |= PCI_EXP_LNKCTL2_TLS_2_5GT; /* gen1 */
- pcie_capability_write_word(adev->pdev, PCI_EXP_LNKCTL2, tmp16);
+ pcie_capability_clear_and_set_word(adev->pdev, PCI_EXP_LNKCTL2,
+ PCI_EXP_LNKCTL2_TLS, tmp16);
speed_cntl = RREG32_PCIE(ixPCIE_LC_SPEED_CNTL);
speed_cntl |= PCIE_LC_SPEED_CNTL__LC_INITIATE_LINK_SPEED_CHANGE_MASK;
diff --git a/drivers/gpu/drm/amd/amdgpu/si.c b/drivers/gpu/drm/amd/amdgpu/si.c
index 4b81f29e5fd5..8ea60fdd1b1d 100644
--- a/drivers/gpu/drm/amd/amdgpu/si.c
+++ b/drivers/gpu/drm/amd/amdgpu/si.c
@@ -2331,28 +2331,18 @@ static void si_pcie_gen3_enable(struct amdgpu_device *adev)
gpu_cfg &
PCI_EXP_LNKCTL_HAWD);
- pcie_capability_read_word(root, PCI_EXP_LNKCTL2,
- &tmp16);
- tmp16 &= ~(PCI_EXP_LNKCTL2_ENTER_COMP |
- PCI_EXP_LNKCTL2_TX_MARGIN);
- tmp16 |= (bridge_cfg2 &
- (PCI_EXP_LNKCTL2_ENTER_COMP |
- PCI_EXP_LNKCTL2_TX_MARGIN));
- pcie_capability_write_word(root,
- PCI_EXP_LNKCTL2,
- tmp16);
-
- pcie_capability_read_word(adev->pdev,
- PCI_EXP_LNKCTL2,
- &tmp16);
- tmp16 &= ~(PCI_EXP_LNKCTL2_ENTER_COMP |
- PCI_EXP_LNKCTL2_TX_MARGIN);
- tmp16 |= (gpu_cfg2 &
- (PCI_EXP_LNKCTL2_ENTER_COMP |
- PCI_EXP_LNKCTL2_TX_MARGIN));
- pcie_capability_write_word(adev->pdev,
- PCI_EXP_LNKCTL2,
- tmp16);
+ pcie_capability_clear_and_set_word(root, PCI_EXP_LNKCTL2,
+ PCI_EXP_LNKCTL2_ENTER_COMP |
+ PCI_EXP_LNKCTL2_TX_MARGIN,
+ bridge_cfg2 &
+ (PCI_EXP_LNKCTL2_ENTER_COMP |
+ PCI_EXP_LNKCTL2_TX_MARGIN));
+ pcie_capability_clear_and_set_word(adev->pdev, PCI_EXP_LNKCTL2,
+ PCI_EXP_LNKCTL2_ENTER_COMP |
+ PCI_EXP_LNKCTL2_TX_MARGIN,
+ gpu_cfg2 &
+ (PCI_EXP_LNKCTL2_ENTER_COMP |
+ PCI_EXP_LNKCTL2_TX_MARGIN));
tmp = RREG32_PCIE_PORT(PCIE_LC_CNTL4);
tmp &= ~LC_SET_QUIESCE;
@@ -2365,16 +2355,15 @@ static void si_pcie_gen3_enable(struct amdgpu_device *adev)
speed_cntl &= ~LC_FORCE_DIS_SW_SPEED_CHANGE;
WREG32_PCIE_PORT(PCIE_LC_SPEED_CNTL, speed_cntl);
- pcie_capability_read_word(adev->pdev, PCI_EXP_LNKCTL2, &tmp16);
- tmp16 &= ~PCI_EXP_LNKCTL2_TLS;
-
+ tmp16 = 0;
if (adev->pm.pcie_gen_mask & CAIL_PCIE_LINK_SPEED_SUPPORT_GEN3)
tmp16 |= PCI_EXP_LNKCTL2_TLS_8_0GT; /* gen3 */
else if (adev->pm.pcie_gen_mask & CAIL_PCIE_LINK_SPEED_SUPPORT_GEN2)
tmp16 |= PCI_EXP_LNKCTL2_TLS_5_0GT; /* gen2 */
else
tmp16 |= PCI_EXP_LNKCTL2_TLS_2_5GT; /* gen1 */
- pcie_capability_write_word(adev->pdev, PCI_EXP_LNKCTL2, tmp16);
+ pcie_capability_clear_and_set_word(adev->pdev, PCI_EXP_LNKCTL2,
+ PCI_EXP_LNKCTL2_TLS, tmp16);
speed_cntl = RREG32_PCIE_PORT(PCIE_LC_SPEED_CNTL);
speed_cntl |= LC_INITIATE_LINK_SPEED_CHANGE;
--
2.30.2
Add a thermal cooling driver to provide path to access PCIe bandwidth
controller using the usual thermal interfaces.
A cooling device is instantiated for controllable PCIe ports from the
bwctrl service driver.
The thermal side state 0 means no throttling, i.e., maximum supported
PCIe speed.
Signed-off-by: Ilpo Järvinen <[email protected]>
Acked-by: Rafael J. Wysocki <[email protected]> # From the cooling device interface perspective
---
MAINTAINERS | 1 +
drivers/pci/pcie/bwctrl.c | 11 ++++
drivers/thermal/Kconfig | 10 +++
drivers/thermal/Makefile | 2 +
drivers/thermal/pcie_cooling.c | 107 +++++++++++++++++++++++++++++++++
include/linux/pci-bwctrl.h | 16 +++++
6 files changed, 147 insertions(+)
create mode 100644 drivers/thermal/pcie_cooling.c
diff --git a/MAINTAINERS b/MAINTAINERS
index 2e4ad074eaf3..40dd7c0b154d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -16574,6 +16574,7 @@ M: Ilpo Järvinen <[email protected]>
L: [email protected]
S: Supported
F: drivers/pci/pcie/bwctrl.c
+F: drivers/thermal/pcie_cooling.c
F: include/linux/pci-bwctrl.h
PCIE DRIVER FOR AMAZON ANNAPURNA LABS
diff --git a/drivers/pci/pcie/bwctrl.c b/drivers/pci/pcie/bwctrl.c
index e3172d69476f..13c73546244e 100644
--- a/drivers/pci/pcie/bwctrl.c
+++ b/drivers/pci/pcie/bwctrl.c
@@ -34,9 +34,11 @@
/**
* struct bwctrl_service_data - PCIe Port Bandwidth Controller
* @set_speed_mutex: serializes link speed changes
+ * @cdev: thermal cooling device associated with the port
*/
struct bwctrl_service_data {
struct mutex set_speed_mutex;
+ struct thermal_cooling_device *cdev;
};
static bool bwctrl_valid_pcie_speed(enum pci_bus_speed speed)
@@ -253,8 +255,16 @@ static int pcie_bandwidth_notification_probe(struct pcie_device *srv)
pcie_enable_link_bandwidth_notification(port);
pci_info(port, "enabled with IRQ %d\n", srv->irq);
+ data->cdev = pcie_cooling_device_register(port, srv);
+ if (IS_ERR(data->cdev)) {
+ ret = PTR_ERR(data->cdev);
+ goto disable_notifications;
+ }
return 0;
+disable_notifications:
+ pcie_disable_link_bandwidth_notification(srv->port);
+ kfree(data);
free_irq:
free_irq(srv->irq, srv);
return ret;
@@ -264,6 +274,7 @@ static void pcie_bandwidth_notification_remove(struct pcie_device *srv)
{
struct bwctrl_service_data *data = get_service_data(srv);
+ pcie_cooling_device_unregister(data->cdev);
pcie_disable_link_bandwidth_notification(srv->port);
free_irq(srv->irq, srv);
mutex_destroy(&data->set_speed_mutex);
diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
index c81a00fbca7d..3a071396f1c6 100644
--- a/drivers/thermal/Kconfig
+++ b/drivers/thermal/Kconfig
@@ -219,6 +219,16 @@ config DEVFREQ_THERMAL
If you want this support, you should say Y here.
+config PCIE_THERMAL
+ bool "PCIe cooling support"
+ depends on PCIEPORTBUS
+ select PCIE_BW
+ help
+ This implements PCIe cooling mechanism through bandwidth reduction
+ for PCIe devices.
+
+ If you want this support, you should say Y here.
+
config THERMAL_EMULATION
bool "Thermal emulation mode support"
help
diff --git a/drivers/thermal/Makefile b/drivers/thermal/Makefile
index c934cab309ae..a0b25a2742b7 100644
--- a/drivers/thermal/Makefile
+++ b/drivers/thermal/Makefile
@@ -30,6 +30,8 @@ thermal_sys-$(CONFIG_CPU_IDLE_THERMAL) += cpuidle_cooling.o
# devfreq cooling
thermal_sys-$(CONFIG_DEVFREQ_THERMAL) += devfreq_cooling.o
+thermal_sys-$(CONFIG_PCIE_THERMAL) += pcie_cooling.o
+
obj-$(CONFIG_K3_THERMAL) += k3_bandgap.o k3_j72xx_bandgap.o
# platform thermal drivers
obj-y += broadcom/
diff --git a/drivers/thermal/pcie_cooling.c b/drivers/thermal/pcie_cooling.c
new file mode 100644
index 000000000000..c23b59dd0331
--- /dev/null
+++ b/drivers/thermal/pcie_cooling.c
@@ -0,0 +1,107 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * PCIe cooling device
+ *
+ * Copyright (C) 2023 Intel Corporation.
+ */
+
+#include <linux/build_bug.h>
+#include <linux/err.h>
+#include <linux/kernel.h>
+#include <linux/pci.h>
+#include <linux/pci-bwctrl.h>
+#include <linux/slab.h>
+#include <linux/string.h>
+#include <linux/thermal.h>
+
+#define COOLING_DEV_TYPE_PREFIX "PCIe_Port_Link_Speed_"
+
+struct pcie_cooling_device {
+ struct pci_dev *port;
+ struct pcie_device *pdev;
+};
+
+static int pcie_cooling_get_max_level(struct thermal_cooling_device *cdev, unsigned long *state)
+{
+ struct pcie_cooling_device *pcie_cdev = cdev->devdata;
+
+ /* cooling state 0 is same as the maximum PCIe speed */
+ *state = pcie_cdev->port->subordinate->max_bus_speed - PCIE_SPEED_2_5GT;
+
+ return 0;
+}
+
+static int pcie_cooling_get_cur_level(struct thermal_cooling_device *cdev, unsigned long *state)
+{
+ struct pcie_cooling_device *pcie_cdev = cdev->devdata;
+
+ /* cooling state 0 is same as the maximum PCIe speed */
+ *state = cdev->max_state -
+ (pcie_cdev->port->subordinate->cur_bus_speed - PCIE_SPEED_2_5GT);
+
+ return 0;
+}
+
+static int pcie_cooling_set_cur_level(struct thermal_cooling_device *cdev, unsigned long state)
+{
+ struct pcie_cooling_device *pcie_cdev = cdev->devdata;
+ enum pci_bus_speed speed;
+
+ /* cooling state 0 is same as the maximum PCIe speed */
+ speed = (cdev->max_state - state) + PCIE_SPEED_2_5GT;
+
+ return bwctrl_set_current_speed(pcie_cdev->pdev, speed);
+}
+
+static struct thermal_cooling_device_ops pcie_cooling_ops = {
+ .get_max_state = pcie_cooling_get_max_level,
+ .get_cur_state = pcie_cooling_get_cur_level,
+ .set_cur_state = pcie_cooling_set_cur_level,
+};
+
+struct thermal_cooling_device *pcie_cooling_device_register(struct pci_dev *port,
+ struct pcie_device *pdev)
+{
+ struct pcie_cooling_device *pcie_cdev;
+ struct thermal_cooling_device *cdev;
+ size_t name_len;
+ char *name;
+
+ pcie_cdev = kzalloc(sizeof(*pcie_cdev), GFP_KERNEL);
+ if (!pcie_cdev)
+ return ERR_PTR(-ENOMEM);
+
+ pcie_cdev->port = port;
+ pcie_cdev->pdev = pdev;
+
+ name_len = strlen(COOLING_DEV_TYPE_PREFIX) + strlen(pci_name(port)) + 1;
+ name = kzalloc(name_len, GFP_KERNEL);
+ if (!name) {
+ kfree(pcie_cdev);
+ return ERR_PTR(-ENOMEM);
+ }
+
+ snprintf(name, name_len, COOLING_DEV_TYPE_PREFIX "%s", pci_name(port));
+ cdev = thermal_cooling_device_register(name, pcie_cdev, &pcie_cooling_ops);
+ kfree(name);
+
+ return cdev;
+}
+
+void pcie_cooling_device_unregister(struct thermal_cooling_device *cdev)
+{
+ struct pcie_cooling_device *pcie_cdev = cdev->devdata;
+
+ thermal_cooling_device_unregister(cdev);
+ kfree(pcie_cdev);
+}
+
+/* For bus_speed <-> state arithmetic */
+static_assert(PCIE_SPEED_2_5GT + 1 == PCIE_SPEED_5_0GT);
+static_assert(PCIE_SPEED_5_0GT + 1 == PCIE_SPEED_8_0GT);
+static_assert(PCIE_SPEED_8_0GT + 1 == PCIE_SPEED_16_0GT);
+static_assert(PCIE_SPEED_16_0GT + 1 == PCIE_SPEED_32_0GT);
+static_assert(PCIE_SPEED_32_0GT + 1 == PCIE_SPEED_64_0GT);
+
+MODULE_AUTHOR("Ilpo Järvinen <[email protected]>");
+MODULE_DESCRIPTION("PCIe cooling driver");
diff --git a/include/linux/pci-bwctrl.h b/include/linux/pci-bwctrl.h
index 8eae09bd03b5..366445517b72 100644
--- a/include/linux/pci-bwctrl.h
+++ b/include/linux/pci-bwctrl.h
@@ -11,7 +11,23 @@
#include <linux/pci.h>
struct pcie_device;
+struct thermal_cooling_device;
int bwctrl_set_current_speed(struct pcie_device *srv, enum pci_bus_speed speed);
+#ifdef CONFIG_PCIE_THERMAL
+struct thermal_cooling_device *pcie_cooling_device_register(struct pci_dev *port,
+ struct pcie_device *pdev);
+void pcie_cooling_device_unregister(struct thermal_cooling_device *cdev);
+#else
+static inline struct thermal_cooling_device *pcie_cooling_device_register(struct pci_dev *port,
+ struct pcie_device *pdev)
+{
+ return NULL;
+}
+static inline void pcie_cooling_device_unregister(struct thermal_cooling_device *cdev)
+{
+}
+#endif
+
#endif
--
2.30.2
On Fri, Sep 29, 2023 at 02:57:21PM +0300, Ilpo J?rvinen wrote:
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -16569,6 +16569,13 @@ F: include/linux/pci*
> F: include/uapi/linux/pci*
> F: lib/pci*
>
> +PCIE BANDWIDTH CONTROLLER
> +M: Ilpo J?rvinen <[email protected]>
> +L: [email protected]
> +S: Supported
> +F: drivers/pci/pcie/bwctrl.c
> +F: include/linux/pci-bwctrl.h
> +
Maybe add this already in the preceding patch.
> --- a/drivers/pci/pcie/Kconfig
> +++ b/drivers/pci/pcie/Kconfig
> @@ -138,12 +138,13 @@ config PCIE_PTM
> is safe to enable even if you don't.
>
> config PCIE_BW
> - bool "PCI Express Bandwidth Change Notification"
> + bool "PCI Express Bandwidth Controller"
I'd fold this change into the preceding patch.
Deleting a line introduced by the preceding patch isn't great.
Never mind that the patch isn't a clean revert that way.
Your approach of making changes to the original version
of the bandwith monitoring driver and calling out those
changes in the commit message seems fine to me.
> depends on PCIEPORTBUS
> help
> - This enables PCI Express Bandwidth Change Notification. If
> - you know link width or rate changes occur to correct unreliable
> - links, you may answer Y.
> + This enables PCI Express Bandwidth Controller. The Bandwidth
> + Controller allows controlling PCIe link speed and listens for link
> + peed Change Notifications. If you know link width or rate changes
> + occur to correct unreliable links, you may answer Y.
It would be neater if you could avoid deleting lines introduced by
the preceding patch. Ideally you'd add a new paragraph, thus only
add new lines but not delete any existing ones.
> --- a/drivers/pci/pcie/bwctrl.c
> +++ b/drivers/pci/pcie/bwctrl.c
> @@ -1,14 +1,16 @@
> // SPDX-License-Identifier: GPL-2.0+
> /*
> - * PCI Express Link Bandwidth Notification services driver
> + * PCIe bandwidth controller
> + *
Again, fold this change into the preceding patch please.
> * Author: Alexandru Gagniuc <[email protected]>
> *
> * Copyright (C) 2019, Dell Inc
> + * Copyright (C) 2023 Intel Corporation.
For extra neatness, drop the comma in the preceding patch and
the period in this patch.
> - * The PCIe Link Bandwidth Notification provides a way to notify the
> - * operating system when the link width or data rate changes. This
> - * capability is required for all root ports and downstream ports
> - * supporting links wider than x1 and/or multiple link speeds.
> + * The PCIe Bandwidth Controller provides a way to alter PCIe link speeds
> + * and notify the operating system when the link width or data rate changes.
> + * The notification capability is required for all Root Ports and Downstream
> + * Ports supporting links wider than x1 and/or multiple link speeds.
Again, adding a new paragraph and not deleting lines introduced by
the preceding patch would be much nicer.
> +#include <linux/bitops.h>
> +#include <linux/errno.h>
> +#include <linux/mutex.h>
> +#include <linux/pci.h>
> +#include <linux/pci-bwctrl.h>
> +#include <linux/types.h>
> +
> +#include <asm/rwonce.h>
> +
Which of these are necessary for functionality introduced by this
patch and which are necessary anyway and should be moved to the
preceding patch?
> +/**
> + * struct bwctrl_service_data - PCIe Port Bandwidth Controller
Code comment at the top of this file says "PCIe bandwidth controller",
use here as well for neatness.
> +static u16 speed2lnkctl2(enum pci_bus_speed speed)
> +{
> + static const u16 speed_conv[] = {
> + [PCIE_SPEED_2_5GT] = PCI_EXP_LNKCTL2_TLS_2_5GT,
> + [PCIE_SPEED_5_0GT] = PCI_EXP_LNKCTL2_TLS_5_0GT,
> + [PCIE_SPEED_8_0GT] = PCI_EXP_LNKCTL2_TLS_8_0GT,
> + [PCIE_SPEED_16_0GT] = PCI_EXP_LNKCTL2_TLS_16_0GT,
> + [PCIE_SPEED_32_0GT] = PCI_EXP_LNKCTL2_TLS_32_0GT,
> + [PCIE_SPEED_64_0GT] = PCI_EXP_LNKCTL2_TLS_64_0GT,
> + };
Looks like this could be a u8 array to save a little bit of memory.
Also, this seems to duplicate pcie_link_speed[] defined in probe.c.
> +static int bwctrl_select_speed(struct pcie_device *srv, enum pci_bus_speed *speed)
> +{
> + struct pci_bus *bus = srv->port->subordinate;
> + u8 speeds, dev_speeds;
> + int i;
> +
> + if (*speed > PCIE_LNKCAP2_SLS2SPEED(bus->pcie_bus_speeds))
> + return -EINVAL;
> +
> + dev_speeds = READ_ONCE(bus->pcie_dev_speeds);
Hm, why is the compiler barrier needed?
> + /* Only the lowest speed can be set when there are no devices */
> + if (!dev_speeds)
> + dev_speeds = PCI_EXP_LNKCAP2_SLS_2_5GB;
Maybe move this to patch [06/10], i.e. set dev->bus->pcie_dev_speeds to
PCI_EXP_LNKCAP2_SLS_2_5GB on removal (instead of 0).
> + speeds = bus->pcie_bus_speeds & dev_speeds;
> + i = BIT(fls(speeds));
> + while (i >= PCI_EXP_LNKCAP2_SLS_2_5GB) {
> + enum pci_bus_speed candidate;
> +
> + if (speeds & i) {
> + candidate = PCIE_LNKCAP2_SLS2SPEED(i);
> + if (candidate <= *speed) {
> + *speed = candidate;
> + return 0;
> + }
> + }
> + i >>= 1;
> + }
Can't we just do something like
supported_speeds = bus->pcie_bus_speeds & dev_speeds;
desired_speeds = GEN_MASK(pcie_link_speed[*speed], 1);
*speed = BIT(fls(supported_speeds & desired_speeds));
and thus avoid the loop altogether?
> +/**
> + * bwctrl_set_current_speed - Set downstream link speed for PCIe port
> + * @srv: PCIe port
> + * @speed: PCIe bus speed to set
> + *
> + * Attempts to set PCIe port link speed to @speed. As long as @speed is less
> + * than the maximum of what is supported by @srv, the speed is adjusted
> + * downwards to the best speed supported by both the port and device
> + * underneath it.
> + *
> + * Return:
> + * * 0 - on success
> + * * -EINVAL - @speed is higher than the maximum @srv supports
> + * * -ETIMEDOUT - changing link speed took too long
> + * * -EAGAIN - link speed was changed but @speed was not achieved
> + */
So passing a higher speed than what's supported by the Endpoint is fine
but passing a higher speed than what's supported by the Downstream Port
is not? Why the distinction? Maybe it would be more logical to just
pick the maximum of what either supports and use that in lieu of a
too-high speed?
> +int bwctrl_set_current_speed(struct pcie_device *srv, enum pci_bus_speed speed)
> +{
> + struct bwctrl_service_data *data = get_service_data(srv);
> + struct pci_dev *port = srv->port;
> + u16 link_status;
> + int ret;
> +
> + if (WARN_ON_ONCE(!bwctrl_valid_pcie_speed(speed)))
> + return -EINVAL;
> +
> + ret = bwctrl_select_speed(srv, &speed);
> + if (ret < 0)
> + return ret;
How about checking here if the selected speed is equivalent to what's
being used right now and bailing out if so? No need to retrain if
we're already at the desired speed.
> @@ -91,16 +242,32 @@ static int pcie_bandwidth_notification_probe(struct pcie_device *srv)
> if (ret)
> return ret;
>
> + data = kzalloc(sizeof(*data), GFP_KERNEL);
> + if (!data) {
> + ret = -ENOMEM;
> + goto free_irq;
> + }
> + mutex_init(&data->set_speed_mutex);
> + set_service_data(srv, data);
> +
I think you should move that further up so that you allocate and populate
the data struct before requesting the IRQ. Otherwise if BIOS has already
enabled link bandwith notification for some reason, the IRQ handler might
be invoked without the data struct being allocated.
> --- /dev/null
> +++ b/include/linux/pci-bwctrl.h
> @@ -0,0 +1,17 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * PCIe bandwidth controller
> + *
> + * Copyright (C) 2023 Intel Corporation.
Another trailing period I'd remove.
Thanks,
Lukas
On Fri, Sep 29, 2023 at 02:57:22PM +0300, Ilpo J?rvinen wrote:
> @@ -253,8 +255,16 @@ static int pcie_bandwidth_notification_probe(struct pcie_device *srv)
> pcie_enable_link_bandwidth_notification(port);
> pci_info(port, "enabled with IRQ %d\n", srv->irq);
>
> + data->cdev = pcie_cooling_device_register(port, srv);
> + if (IS_ERR(data->cdev)) {
> + ret = PTR_ERR(data->cdev);
> + goto disable_notifications;
> + }
> return 0;
Now wait a minute, if you can't register the cooling device,
you still want to provide accurate link speeds to the user
in sysfs, right? At least that's what you promise in Kconfig.
> --- /dev/null
> +++ b/drivers/thermal/pcie_cooling.c
> @@ -0,0 +1,107 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * PCIe cooling device
> + *
> + * Copyright (C) 2023 Intel Corporation.
Another trailing period I'd remove.
I take it this patch (only) allows manual bandwidth throttling
through sysfs, right? And emergency throttling is introduced
separately on top of this?
Thanks,
Lukas
On Sat, 30 Dec 2023, Lukas Wunner wrote:
> On Fri, Sep 29, 2023 at 02:57:22PM +0300, Ilpo J?rvinen wrote:
> > @@ -253,8 +255,16 @@ static int pcie_bandwidth_notification_probe(struct pcie_device *srv)
> > pcie_enable_link_bandwidth_notification(port);
> > pci_info(port, "enabled with IRQ %d\n", srv->irq);
> >
> > + data->cdev = pcie_cooling_device_register(port, srv);
> > + if (IS_ERR(data->cdev)) {
> > + ret = PTR_ERR(data->cdev);
> > + goto disable_notifications;
> > + }
> > return 0;
>
> Now wait a minute, if you can't register the cooling device,
> you still want to provide accurate link speeds to the user
> in sysfs, right? At least that's what you promise in Kconfig.
When thermal side is not even configured, it returns NULL which is not
ERR.
I guess I can change the behavior for the real ERR cases (I was bit on
borderline what to do with those failures).
> > --- /dev/null
> > +++ b/drivers/thermal/pcie_cooling.c
> > @@ -0,0 +1,107 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * PCIe cooling device
> > + *
> > + * Copyright (C) 2023 Intel Corporation.
>
> Another trailing period I'd remove.
>
> I take it this patch (only) allows manual bandwidth throttling
> through sysfs, right? And emergency throttling is introduced
> separately on top of this?
Only sysfs throttling is introduced by this series, there's no emergency
throttling in the series. Also, many things have been simplified in this
series because more complex things would be only justified with
the emergency throttling and would just raise questions 'why is this and
that being done' (e.g., the critical section was enlarged as per your
request where if there would be emergency throttlink doesn't make sense to
wait until the end of the link training before "emergency throttling" can
attempt to lower the link speed).
--
i.
On Sat, 30 Dec 2023, Lukas Wunner wrote:
> On Fri, Sep 29, 2023 at 02:57:21PM +0300, Ilpo J?rvinen wrote:
>
>
> > --- a/drivers/pci/pcie/bwctrl.c
> > +static u16 speed2lnkctl2(enum pci_bus_speed speed)
> > +{
> > + static const u16 speed_conv[] = {
> > + [PCIE_SPEED_2_5GT] = PCI_EXP_LNKCTL2_TLS_2_5GT,
> > + [PCIE_SPEED_5_0GT] = PCI_EXP_LNKCTL2_TLS_5_0GT,
> > + [PCIE_SPEED_8_0GT] = PCI_EXP_LNKCTL2_TLS_8_0GT,
> > + [PCIE_SPEED_16_0GT] = PCI_EXP_LNKCTL2_TLS_16_0GT,
> > + [PCIE_SPEED_32_0GT] = PCI_EXP_LNKCTL2_TLS_32_0GT,
> > + [PCIE_SPEED_64_0GT] = PCI_EXP_LNKCTL2_TLS_64_0GT,
> > + };
>
> Looks like this could be a u8 array to save a little bit of memory.
>
> Also, this seems to duplicate pcie_link_speed[] defined in probe.c.
It's not the same. pcie_link_speed[] is indexed by a different thing
unless you also suggest I should do index minus a number? There are
plenty of arithmetics possible when converting between the types but
the existing converions functions don't seem to take advantage of those
properties so I've been a bit hesitant to add such assumptions myself.
I suppose I used u16 because the reg is u16 but you're of course correct
that the values don't take up more than u8.
> > +static int bwctrl_select_speed(struct pcie_device *srv, enum pci_bus_speed *speed)
> > +{
> > + struct pci_bus *bus = srv->port->subordinate;
> > + u8 speeds, dev_speeds;
> > + int i;
> > +
> > + if (*speed > PCIE_LNKCAP2_SLS2SPEED(bus->pcie_bus_speeds))
> > + return -EINVAL;
> > +
> > + dev_speeds = READ_ONCE(bus->pcie_dev_speeds);
>
> Hm, why is the compiler barrier needed?
It's probably an overkill but there could be a checker which finds this
read is not protected by anything while the value could get updated
concurrectly (there's probably already such checker as I've seen patches
to data races found with some tool). I suppose the alternative would be to
mark the data race being harmless (because if endpoint removal clears
pcie_dev_speeds, bwctrl will be pretty moot). I just chose to use
READ_ONCE() that prevents rereading the same value later in this function
making the function behave consistently regardless of what occurs parallel
to it with the endpoints.
If the value selected cannot be set because of endpoint no longer being
there, the other parts of the code will detect that.
So if I just add a comment to this line why there's the data race and keep
it as is?
> > + /* Only the lowest speed can be set when there are no devices */
> > + if (!dev_speeds)
> > + dev_speeds = PCI_EXP_LNKCAP2_SLS_2_5GB;
>
> Maybe move this to patch [06/10], i.e. set dev->bus->pcie_dev_speeds to
> PCI_EXP_LNKCAP2_SLS_2_5GB on removal (instead of 0).
Okay, I'll set it to 2_5GB on init and removal.
> > + speeds = bus->pcie_bus_speeds & dev_speeds;
> > + i = BIT(fls(speeds));
> > + while (i >= PCI_EXP_LNKCAP2_SLS_2_5GB) {
> > + enum pci_bus_speed candidate;
> > +
> > + if (speeds & i) {
> > + candidate = PCIE_LNKCAP2_SLS2SPEED(i);
> > + if (candidate <= *speed) {
> > + *speed = candidate;
> > + return 0;
> > + }
> > + }
> > + i >>= 1;
> > + }
>
> Can't we just do something like
>
> supported_speeds = bus->pcie_bus_speeds & dev_speeds;
> desired_speeds = GEN_MASK(pcie_link_speed[*speed], 1);
> *speed = BIT(fls(supported_speeds & desired_speeds));
>
> and thus avoid the loop altogether?
Yes, I can change to loopless version.
I'll try to create functions for the speed format conversions though
rather than open coding them into the logic.
> > @@ -91,16 +242,32 @@ static int pcie_bandwidth_notification_probe(struct pcie_device *srv)
> > if (ret)
> > return ret;
> >
> > + data = kzalloc(sizeof(*data), GFP_KERNEL);
> > + if (!data) {
> > + ret = -ENOMEM;
> > + goto free_irq;
> > + }
> > + mutex_init(&data->set_speed_mutex);
> > + set_service_data(srv, data);
> > +
>
> I think you should move that further up so that you allocate and populate
> the data struct before requesting the IRQ. Otherwise if BIOS has already
> enabled link bandwith notification for some reason, the IRQ handler might
> be invoked without the data struct being allocated.
Sure, I don't know why I missed that possibility.
--
i.
On Mon, Jan 01, 2024 at 08:12:43PM +0200, Ilpo J?rvinen wrote:
> On Sat, 30 Dec 2023, Lukas Wunner wrote:
> > On Fri, Sep 29, 2023 at 02:57:21PM +0300, Ilpo J?rvinen wrote:
> > > --- a/drivers/pci/pcie/bwctrl.c
> > > +static u16 speed2lnkctl2(enum pci_bus_speed speed)
> > > +{
> > > + static const u16 speed_conv[] = {
> > > + [PCIE_SPEED_2_5GT] = PCI_EXP_LNKCTL2_TLS_2_5GT,
> > > + [PCIE_SPEED_5_0GT] = PCI_EXP_LNKCTL2_TLS_5_0GT,
> > > + [PCIE_SPEED_8_0GT] = PCI_EXP_LNKCTL2_TLS_8_0GT,
> > > + [PCIE_SPEED_16_0GT] = PCI_EXP_LNKCTL2_TLS_16_0GT,
> > > + [PCIE_SPEED_32_0GT] = PCI_EXP_LNKCTL2_TLS_32_0GT,
> > > + [PCIE_SPEED_64_0GT] = PCI_EXP_LNKCTL2_TLS_64_0GT,
> > > + };
> >
> > Looks like this could be a u8 array to save a little bit of memory.
> >
> > Also, this seems to duplicate pcie_link_speed[] defined in probe.c.
>
> It's not the same. pcie_link_speed[] is indexed by a different thing
Ah, missed that. Those various conversion functions are confusing.
> > > + dev_speeds = READ_ONCE(bus->pcie_dev_speeds);
> >
> > Hm, why is the compiler barrier needed?
>
> It's probably an overkill but there could be a checker which finds this
> read is not protected by anything while the value could get updated
> concurrectly
Why should it be updated? It's a static value cached on enumeration
and never changed AFAICS.
> If the value selected cannot be set because of endpoint no longer being
> there, the other parts of the code will detect that.
If the endpoint is hot-unplugged, the link is down, so retraining
the link will fail.
If the device is replaced concurrently to retraining then you may
try to retrain to a speed which is too fast or too slow for the new
device. Maybe it's necessary to cope with that? Basically find the
pci_dev on the bus with the device/function id and check if the pci_dev
pointer still points to the same location. Another idea would be to
hook up bwctrl with pciehp so that pciehp tells bwctrl the device is
gone and any speed changes should be aborted. We've hooked up DPC
with pciehp in a similar way.
> So if I just add a comment to this line why there's the data race and keep
> it as is?
I'd just drop the READ_ONCE() here if there's not a good reason for it.
Thanks,
Lukas