This series provides knobs which will enable users to
minimize memory consumption of mlx5 Functions (PF/VF/SF).
mlx5 exposes two new generic devlink resources for EQ size
configuration and uses devlink generic param max_macs.
Patches summary:
- Patch-1 Provides I/O EQ size resource which enables to save
up to 128KB.
- Patch-2 Provides event EQ size resource which enables to save up to
512KB.
- Patch-3 Clarify max_macs param.
- Patch-4 Provides max_macs param which enables to save up to 70KB
In total, this series can save up to 700KB per Function.
Shay Drory (4):
net/mlx5: Let user configure io_eq_size resource
net/mlx5: Let user configure event_eq_size resource
devlink: Clarifies max_macs generic devlink param
net/mlx5: Let user configure max_macs generic param
.../networking/devlink/devlink-params.rst | 6 +-
.../networking/devlink/devlink-resource.rst | 4 +
Documentation/networking/devlink/mlx5.rst | 4 +
.../net/ethernet/mellanox/mlx5/core/Makefile | 2 +-
.../net/ethernet/mellanox/mlx5/core/devlink.c | 67 ++++++++++++++++
.../net/ethernet/mellanox/mlx5/core/devlink.h | 12 +++
.../ethernet/mellanox/mlx5/core/devlink_res.c | 79 +++++++++++++++++++
drivers/net/ethernet/mellanox/mlx5/core/eq.c | 5 +-
.../net/ethernet/mellanox/mlx5/core/main.c | 21 +++++
include/linux/mlx5/driver.h | 4 -
include/linux/mlx5/eq.h | 1 -
include/linux/mlx5/mlx5_ifc.h | 2 +-
include/net/devlink.h | 2 +
13 files changed, 198 insertions(+), 11 deletions(-)
create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/devlink_res.c
--
2.21.3
Currently, max_macs is taking 70Kbytes of memory per function. This
size is not needed in all use cases, and is critical with large scale.
Hence, allow user to configure the number of max_macs.
For example, to reduce the number of max_macs to 1, execute::
$ devlink dev param set pci/0000:00:0b.0 name max_macs value 1 \
cmode driverinit
$ devlink dev reload pci/0000:00:0b.0
Signed-off-by: Shay Drory <[email protected]>
Reviewed-by: Moshe Shemesh <[email protected]>
Reviewed-by: Parav Pandit <[email protected]>
---
Documentation/networking/devlink/mlx5.rst | 4 ++
.../net/ethernet/mellanox/mlx5/core/devlink.c | 67 +++++++++++++++++++
.../net/ethernet/mellanox/mlx5/core/main.c | 18 +++++
include/linux/mlx5/mlx5_ifc.h | 2 +-
4 files changed, 90 insertions(+), 1 deletion(-)
diff --git a/Documentation/networking/devlink/mlx5.rst b/Documentation/networking/devlink/mlx5.rst
index 4e4b97f7971a..c44043bcae72 100644
--- a/Documentation/networking/devlink/mlx5.rst
+++ b/Documentation/networking/devlink/mlx5.rst
@@ -14,8 +14,12 @@ Parameters
* - Name
- Mode
+ - Validation
* - ``enable_roce``
- driverinit
+ * - ``max_macs``
+ - driverinit
+ - The range is between 1 and 2^31. Only power of 2 values are supported.
The ``mlx5`` driver also implements the following driver-specific
parameters.
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/devlink.c b/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
index 1c98652b244a..7383b727f49e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
@@ -752,6 +752,66 @@ static void mlx5_devlink_auxdev_params_unregister(struct devlink *devlink)
mlx5_devlink_eth_param_unregister(devlink);
}
+static int mlx5_devlink_max_uc_list_validate(struct devlink *devlink, u32 id,
+ union devlink_param_value val,
+ struct netlink_ext_ack *extack)
+{
+ struct mlx5_core_dev *dev = devlink_priv(devlink);
+
+ if (val.vu32 == 0) {
+ NL_SET_ERR_MSG_MOD(extack, "max_macs value must be greater than 0");
+ return -EINVAL;
+ }
+
+ if (!is_power_of_2(val.vu32)) {
+ NL_SET_ERR_MSG_MOD(extack, "Only power of 2 values are supported for max_macs");
+ return -EINVAL;
+ }
+
+ if (ilog2(val.vu32) >
+ MLX5_CAP_GEN_MAX(dev, log_max_current_uc_list)) {
+ NL_SET_ERR_MSG_MOD(extack, "max_macs value is out of the supported range");
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
+static const struct devlink_param max_uc_list_param =
+ DEVLINK_PARAM_GENERIC(MAX_MACS, BIT(DEVLINK_PARAM_CMODE_DRIVERINIT),
+ NULL, NULL, mlx5_devlink_max_uc_list_validate);
+
+static int mlx5_devlink_max_uc_list_param_register(struct devlink *devlink)
+{
+ struct mlx5_core_dev *dev = devlink_priv(devlink);
+ union devlink_param_value value;
+ int err;
+
+ if (!MLX5_CAP_GEN_MAX(dev, log_max_current_uc_list_wr_supported))
+ return 0;
+
+ err = devlink_param_register(devlink, &max_uc_list_param);
+ if (err)
+ return err;
+
+ value.vu32 = 1 << MLX5_CAP_GEN(dev, log_max_current_uc_list);
+ devlink_param_driverinit_value_set(devlink,
+ DEVLINK_PARAM_GENERIC_ID_MAX_MACS,
+ value);
+ return 0;
+}
+
+static void
+mlx5_devlink_max_uc_list_param_unregister(struct devlink *devlink)
+{
+ struct mlx5_core_dev *dev = devlink_priv(devlink);
+
+ if (!MLX5_CAP_GEN(dev, log_max_current_uc_list_wr_supported))
+ return;
+
+ devlink_param_unregister(devlink, &max_uc_list_param);
+}
+
#define MLX5_TRAP_DROP(_id, _group_id) \
DEVLINK_TRAP_GENERIC(DROP, DROP, _id, \
DEVLINK_TRAP_GROUP_GENERIC_ID_##_group_id, \
@@ -815,11 +875,17 @@ int mlx5_devlink_register(struct devlink *devlink)
if (err)
goto traps_reg_err;
+ err = mlx5_devlink_max_uc_list_param_register(devlink);
+ if (err)
+ goto uc_list_reg_err;
+
if (!mlx5_core_is_mp_slave(dev))
devlink_set_features(devlink, DEVLINK_F_RELOAD);
return 0;
+uc_list_reg_err:
+ mlx5_devlink_traps_unregister(devlink);
traps_reg_err:
mlx5_devlink_auxdev_params_unregister(devlink);
auxdev_reg_err:
@@ -830,6 +896,7 @@ int mlx5_devlink_register(struct devlink *devlink)
void mlx5_devlink_unregister(struct devlink *devlink)
{
+ mlx5_devlink_max_uc_list_param_unregister(devlink);
mlx5_devlink_traps_unregister(devlink);
mlx5_devlink_auxdev_params_unregister(devlink);
devlink_params_unregister(devlink, mlx5_devlink_params,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index f55a89bd3736..a6819575854f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -484,10 +484,23 @@ static int handle_hca_cap_odp(struct mlx5_core_dev *dev, void *set_ctx)
return set_caps(dev, set_ctx, MLX5_SET_HCA_CAP_OP_MOD_ODP);
}
+static int max_uc_list_get_devlink_param(struct mlx5_core_dev *dev)
+{
+ struct devlink *devlink = priv_to_devlink(dev);
+ union devlink_param_value val;
+ int err;
+
+ err = devlink_param_driverinit_value_get(devlink,
+ DEVLINK_PARAM_GENERIC_ID_MAX_MACS,
+ &val);
+ return err ? err : val.vu32;
+}
+
static int handle_hca_cap(struct mlx5_core_dev *dev, void *set_ctx)
{
struct mlx5_profile *prof = &dev->profile;
void *set_hca_cap;
+ int max_uc_list;
int err;
err = mlx5_core_get_caps(dev, MLX5_CAP_GENERAL);
@@ -561,6 +574,11 @@ static int handle_hca_cap(struct mlx5_core_dev *dev, void *set_ctx)
if (MLX5_CAP_GEN(dev, roce_rw_supported))
MLX5_SET(cmd_hca_cap, set_hca_cap, roce, mlx5_is_roce_init_enabled(dev));
+ max_uc_list = max_uc_list_get_devlink_param(dev);
+ if (max_uc_list > 0)
+ MLX5_SET(cmd_hca_cap, set_hca_cap, log_max_current_uc_list,
+ ilog2(max_uc_list));
+
return set_caps(dev, set_ctx, MLX5_SET_HCA_CAP_OP_MOD_GENERAL_DEVICE);
}
diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
index 3636df90899a..d3899fc33fd7 100644
--- a/include/linux/mlx5/mlx5_ifc.h
+++ b/include/linux/mlx5/mlx5_ifc.h
@@ -1621,7 +1621,7 @@ struct mlx5_ifc_cmd_hca_cap_bits {
u8 ext_stride_num_range[0x1];
u8 roce_rw_supported[0x1];
- u8 reserved_at_3a2[0x1];
+ u8 log_max_current_uc_list_wr_supported[0x1];
u8 log_max_stride_sz_rq[0x5];
u8 reserved_at_3a8[0x3];
u8 log_min_stride_sz_rq[0x5];
--
2.21.3
On Tue, 30 Nov 2021 17:07:02 +0200 Shay Drory wrote:
> - Patch-1 Provides I/O EQ size resource which enables to save
> up to 128KB.
> - Patch-2 Provides event EQ size resource which enables to save up to
> 512KB.
Why is something allocated in host memory a device resource? ????
Did you analyze if others may need this?
On 11/30/2021 21:39, Jakub Kicinski wrote:
> On Tue, 30 Nov 2021 17:07:02 +0200 Shay Drory wrote:
>> - Patch-1 Provides I/O EQ size resource which enables to save
>> up to 128KB.
>> - Patch-2 Provides event EQ size resource which enables to save up to
>> 512KB.
> Why is something allocated in host memory a device resource? ????
EQ resides in the host memory. It is RO for host driver, RW by device.
When interrupt is generated EQ entry is placed by device and read by driver.
It indicates about what event occurred such as CQE, async and more.
> Did you analyze if others may need this?
So far no feedback by other vendors.
The resources are implemented in generic way, if other vendors would
like to implement them.
On Wed, 1 Dec 2021 10:22:17 +0200 Shay Drory wrote:
> On 11/30/2021 21:39, Jakub Kicinski wrote:
> > On Tue, 30 Nov 2021 17:07:02 +0200 Shay Drory wrote:
> >> - Patch-1 Provides I/O EQ size resource which enables to save
> >> up to 128KB.
> >> - Patch-2 Provides event EQ size resource which enables to save up to
> >> 512KB.
> > Why is something allocated in host memory a device resource? ????
>
> EQ resides in the host memory. It is RO for host driver, RW by device.
> When interrupt is generated EQ entry is placed by device and read by driver.
> It indicates about what event occurred such as CQE, async and more.
I understand that. My point was the resource which is being consumed
here is _host_ memory. Is there precedent for configuring host memory
consumption via devlink resource?
I'd even question whether this belongs in devlink in the first place.
It is not global device config in any way. If devlink represents the
entire device it's rather strange to have a case where main instance
limits a size of some resource by VFs and other endpoints can still
choose whatever they want.
> > Did you analyze if others may need this?
>
> So far no feedback by other vendors.
> The resources are implemented in generic way, if other vendors would
> like to implement them.
Well, I was hoping you'd look around, but maybe that's too much to ask
of a vendor.
On Thu, 2021-12-02 at 09:31 -0800, Jakub Kicinski wrote:
> On Wed, 1 Dec 2021 10:22:17 +0200 Shay Drory wrote:
> > On 11/30/2021 21:39, Jakub Kicinski wrote:
> > > On Tue, 30 Nov 2021 17:07:02 +0200 Shay Drory wrote:
> > > > - Patch-1 Provides I/O EQ size resource which enables to save
> > > > up to 128KB.
> > > > - Patch-2 Provides event EQ size resource which enables to
> > > > save up to
> > > > 512KB.
> > > Why is something allocated in host memory a device resource? ????
> >
> > EQ resides in the host memory. It is RO for host driver, RW by
> > device.
> > When interrupt is generated EQ entry is placed by device and read
> > by driver.
> > It indicates about what event occurred such as CQE, async and more.
>
> I understand that. My point was the resource which is being consumed
> here is _host_ memory. Is there precedent for configuring host memory
> consumption via devlink resource?
>
it's a device resource size nonetheless, devlink resource API makes
total sense.
> I'd even question whether this belongs in devlink in the first place.
> It is not global device config in any way. If devlink represents the
> entire device it's rather strange to have a case where main instance
> limits a size of some resource by VFs and other endpoints can still
> choose whatever they want.
>
This resource is per function instance, we have devlink instance per
function, e.g. in the VM, there is a VF devlink instance the VM user
can use to control own VF resources. in the PF/Hypervisor, the only
devlink representation of the VF will be devlink port function (used
for other purposes)
for example:
A tenant can fine-tune a resource size tailored to their needs via the
VF's own devlink instance.
An admin can only control or restrict a max size of a resource for a
given port function ( the devlink instance that represents the VF in
the hypervisor). (note: this patchset is not about that)
> > > Did you analyze if others may need this?
> >
> > So far no feedback by other vendors.
> > The resources are implemented in generic way, if other vendors
> > would
> > like to implement them.
>
> Well, I was hoping you'd look around, but maybe that's too much to
> ask
> of a vendor.
We looked, eq is a common object among many other drivers.
and DEVLINK_PARAM_GENERIC_ID_MAX_MACS is already a devlink generic
param, and i am sure other vendors have limited macs per VF :) ..
so this applies to all vendors even if they don't advertise it.
On Thu, 2 Dec 2021 18:55:37 +0000 Saeed Mahameed wrote:
> On Thu, 2021-12-02 at 09:31 -0800, Jakub Kicinski wrote:
> > On Wed, 1 Dec 2021 10:22:17 +0200 Shay Drory wrote:
> > > EQ resides in the host memory. It is RO for host driver, RW by
> > > device.
> > > When interrupt is generated EQ entry is placed by device and read
> > > by driver.
> > > It indicates about what event occurred such as CQE, async and more.
> >
> > I understand that. My point was the resource which is being consumed
> > here is _host_ memory. Is there precedent for configuring host memory
> > consumption via devlink resource?
>
> it's a device resource size nonetheless, devlink resource API makes
> total sense.
I disagree. Devlink resources were originally written to partition
finite device resources. You're just sizing a queue here.
> > I'd even question whether this belongs in devlink in the first place.
> > It is not global device config in any way. If devlink represents the
> > entire device it's rather strange to have a case where main instance
> > limits a size of some resource by VFs and other endpoints can still
> > choose whatever they want.
>
> This resource is per function instance, we have devlink instance per
> function, e.g. in the VM, there is a VF devlink instance the VM user
> can use to control own VF resources. in the PF/Hypervisor, the only
> devlink representation of the VF will be devlink port function (used
> for other purposes)
>
> for example:
>
> A tenant can fine-tune a resource size tailored to their needs via the
> VF's own devlink instance.
Yeah, because it's a device resource. Tenant can consume their host
DRAM in any way they find suitable.
> An admin can only control or restrict a max size of a resource for a
> given port function ( the devlink instance that represents the VF in
> the hypervisor). (note: this patchset is not about that)
>
> > > So far no feedback by other vendors.
> > > The resources are implemented in generic way, if other vendors
> > > would
> > > like to implement them.
> >
> > Well, I was hoping you'd look around, but maybe that's too much to
> > ask of a vendor.
>
> We looked, eq is a common object among many other drivers.
> and DEVLINK_PARAM_GENERIC_ID_MAX_MACS is already a devlink generic
> param, and i am sure other vendors have limited macs per VF :) ..
> so this applies to all vendors even if they don't advertise it.
Yeah, if you're not willing to model the Event Queue as a queue using
params seems like a better idea than abusing resources.
Fri, Dec 03, 2021 at 02:28:03AM CET, [email protected] wrote:
>On Thu, 2 Dec 2021 18:55:37 +0000 Saeed Mahameed wrote:
>> On Thu, 2021-12-02 at 09:31 -0800, Jakub Kicinski wrote:
>> > On Wed, 1 Dec 2021 10:22:17 +0200 Shay Drory wrote:
>> > > EQ resides in the host memory. It is RO for host driver, RW by
>> > > device.
>> > > When interrupt is generated EQ entry is placed by device and read
>> > > by driver.
>> > > It indicates about what event occurred such as CQE, async and more.
>> >
>> > I understand that. My point was the resource which is being consumed
>> > here is _host_ memory. Is there precedent for configuring host memory
>> > consumption via devlink resource?
>>
>> it's a device resource size nonetheless, devlink resource API makes
>> total sense.
>
>I disagree. Devlink resources were originally written to partition
>finite device resources. You're just sizing a queue here.
>
>> > I'd even question whether this belongs in devlink in the first place.
>> > It is not global device config in any way. If devlink represents the
>> > entire device it's rather strange to have a case where main instance
>> > limits a size of some resource by VFs and other endpoints can still
>> > choose whatever they want.
>>
>> This resource is per function instance, we have devlink instance per
>> function, e.g. in the VM, there is a VF devlink instance the VM user
>> can use to control own VF resources. in the PF/Hypervisor, the only
>> devlink representation of the VF will be devlink port function (used
>> for other purposes)
>>
>> for example:
>>
>> A tenant can fine-tune a resource size tailored to their needs via the
>> VF's own devlink instance.
>
>Yeah, because it's a device resource. Tenant can consume their host
>DRAM in any way they find suitable.
>
>> An admin can only control or restrict a max size of a resource for a
>> given port function ( the devlink instance that represents the VF in
>> the hypervisor). (note: this patchset is not about that)
>>
>> > > So far no feedback by other vendors.
>> > > The resources are implemented in generic way, if other vendors
>> > > would
>> > > like to implement them.
>> >
>> > Well, I was hoping you'd look around, but maybe that's too much to
>> > ask of a vendor.
>>
>> We looked, eq is a common object among many other drivers.
>> and DEVLINK_PARAM_GENERIC_ID_MAX_MACS is already a devlink generic
>> param, and i am sure other vendors have limited macs per VF :) ..
>> so this applies to all vendors even if they don't advertise it.
>
>Yeah, if you're not willing to model the Event Queue as a queue using
>params seems like a better idea than abusing resources.
I think you are right. On second thought, param look like a better fit.