The last RFC in August 2022 contained a proposal for the UAPI of both
TSN standards which together form Frame Preemption (802.1Q and 802.3):
https://lore.kernel.org/netdev/[email protected]/
It wasn't clear at the time whether the 802.1Q portion of Frame Preemption
should be exposed via the tc qdisc (mqprio, taprio) or via some other
layer (perhaps also ethtool like the 802.3 portion, or dcbnl), even
though the options were discussed extensively, with pros and cons:
https://lore.kernel.org/netdev/[email protected]/
So the 802.3 portion got submitted separately and finally was accepted:
https://lore.kernel.org/netdev/[email protected]/
leaving the only remaining question: how do we expose the 802.1Q bits?
This series proposes that we use the Qdisc layer, through separate
(albeit very similar) UAPI in mqprio and taprio, and that both these
Qdiscs pass the information down to the offloading device driver through
the common mqprio offload structure (which taprio also passes).
An implementation is provided for the NXP LS1028A on-board Ethernet
endpoint (enetc). Previous versions also contained support for its
embedded switch (felix), but this needs more work and will be submitted
separately.
Changes in v4:
- removed felix driver support
Changes in v3:
- fixed build error caused by "default" switch case with no code
- reordered patches: bug fix first, driver changes all at the end
- changed links from patchwork to lore
- passed extack down to ndo_setup_tc() for mqprio and taprio, and made
use of it in ocelot
v2 at:
https://lore.kernel.org/netdev/[email protected]/
Changes in v2:
- add missing EXPORT_SYMBOL_GPL(ethtool_dev_mm_supported)
- slightly reword some commit messages
- move #include <linux/ethtool_netlink.h> to the respective patch in
mqprio
- remove self-evident comment "only for dump and offloading" in mqprio
v1 at:
https://lore.kernel.org/netdev/[email protected]/
Vladimir Oltean (9):
net: ethtool: create and export ethtool_dev_mm_supported()
net/sched: mqprio: simplify handling of nlattr portion of TCA_OPTIONS
net/sched: mqprio: add extack to mqprio_parse_nlattr()
net/sched: mqprio: add an extack message to mqprio_parse_opt()
net/sched: pass netlink extack to mqprio and taprio offload
net/sched: mqprio: allow per-TC user input of FP adminStatus
net/sched: taprio: allow per-TC user input of FP adminStatus
net: enetc: rename "mqprio" to "qopt"
net: enetc: add support for preemptible traffic classes
drivers/net/ethernet/freescale/enetc/enetc.c | 31 ++-
drivers/net/ethernet/freescale/enetc/enetc.h | 1 +
.../net/ethernet/freescale/enetc/enetc_hw.h | 4 +
include/linux/ethtool_netlink.h | 6 +
include/net/pkt_sched.h | 3 +
include/uapi/linux/pkt_sched.h | 17 ++
net/ethtool/mm.c | 23 +++
net/sched/sch_mqprio.c | 187 +++++++++++++++---
net/sched/sch_mqprio_lib.c | 14 ++
net/sched/sch_mqprio_lib.h | 2 +
net/sched/sch_taprio.c | 77 ++++++--
11 files changed, 323 insertions(+), 42 deletions(-)
--
2.34.1
Create a wrapper over __ethtool_dev_mm_supported() which also calls
ethnl_ops_begin() and ethnl_ops_complete(). It can be used by other code
layers, such as tc, to make sure that preemptible TCs are supported
(this is true if an underlying MAC Merge layer exists).
Signed-off-by: Vladimir Oltean <[email protected]>
Reviewed-by: Ferenc Fejes <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
---
v3->v4: none
v2->v3: none
v1->v2:
- don't touch net/sched/sch_mqprio.c in this patch
- add missing EXPORT_SYMBOL_GPL(ethtool_dev_mm_supported)
include/linux/ethtool_netlink.h | 6 ++++++
net/ethtool/mm.c | 23 +++++++++++++++++++++++
2 files changed, 29 insertions(+)
diff --git a/include/linux/ethtool_netlink.h b/include/linux/ethtool_netlink.h
index 17003b385756..fae0dfb9a9c8 100644
--- a/include/linux/ethtool_netlink.h
+++ b/include/linux/ethtool_netlink.h
@@ -39,6 +39,7 @@ void ethtool_aggregate_pause_stats(struct net_device *dev,
struct ethtool_pause_stats *pause_stats);
void ethtool_aggregate_rmon_stats(struct net_device *dev,
struct ethtool_rmon_stats *rmon_stats);
+bool ethtool_dev_mm_supported(struct net_device *dev);
#else
static inline int ethnl_cable_test_alloc(struct phy_device *phydev, u8 cmd)
@@ -112,5 +113,10 @@ ethtool_aggregate_rmon_stats(struct net_device *dev,
{
}
+static inline bool ethtool_dev_mm_supported(struct net_device *dev)
+{
+ return false;
+}
+
#endif /* IS_ENABLED(CONFIG_ETHTOOL_NETLINK) */
#endif /* _LINUX_ETHTOOL_NETLINK_H_ */
diff --git a/net/ethtool/mm.c b/net/ethtool/mm.c
index fce3cc2734f9..e00d7d5cea7e 100644
--- a/net/ethtool/mm.c
+++ b/net/ethtool/mm.c
@@ -249,3 +249,26 @@ bool __ethtool_dev_mm_supported(struct net_device *dev)
return !ret;
}
+
+bool ethtool_dev_mm_supported(struct net_device *dev)
+{
+ const struct ethtool_ops *ops = dev->ethtool_ops;
+ bool supported;
+ int ret;
+
+ ASSERT_RTNL();
+
+ if (!ops)
+ return false;
+
+ ret = ethnl_ops_begin(dev);
+ if (ret < 0)
+ return false;
+
+ supported = __ethtool_dev_mm_supported(dev);
+
+ ethnl_ops_complete(dev);
+
+ return supported;
+}
+EXPORT_SYMBOL_GPL(ethtool_dev_mm_supported);
--
2.34.1
In commit 4e8b86c06269 ("mqprio: Introduce new hardware offload mode and
shaper in mqprio"), the TCA_OPTIONS format of mqprio was extended to
contain a fixed portion (of size NLA_ALIGN(sizeof struct tc_mqprio_qopt))
and a variable portion of other nlattrs (in the TCA_MQPRIO_* type space)
following immediately afterwards.
In commit feb2cf3dcfb9 ("net/sched: mqprio: refactor nlattr parsing to a
separate function"), we've moved the nlattr handling to a smaller
function, but yet, a small parse_attr() still remains, and the larger
mqprio_parse_nlattr() still does not have access to the beginning, and
the length, of the TCA_OPTIONS region containing these other nlattrs.
In a future change, the mqprio qdisc will need to iterate through this
nlattr region to discover other attributes, so eliminate parse_attr()
and add 2 variables in mqprio_parse_nlattr() which hold the beginning
and the length of the nlattr range.
We avoid the need to memset when nlattr_opt_len has insufficient length
by pre-initializing the table "tb".
Signed-off-by: Vladimir Oltean <[email protected]>
Reviewed-by: Ferenc Fejes <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
---
v1->v4: none
net/sched/sch_mqprio.c | 32 +++++++++++++-------------------
1 file changed, 13 insertions(+), 19 deletions(-)
diff --git a/net/sched/sch_mqprio.c b/net/sched/sch_mqprio.c
index 48ed87b91086..94093971da5e 100644
--- a/net/sched/sch_mqprio.c
+++ b/net/sched/sch_mqprio.c
@@ -146,32 +146,26 @@ static const struct nla_policy mqprio_policy[TCA_MQPRIO_MAX + 1] = {
[TCA_MQPRIO_MAX_RATE64] = { .type = NLA_NESTED },
};
-static int parse_attr(struct nlattr *tb[], int maxtype, struct nlattr *nla,
- const struct nla_policy *policy, int len)
-{
- int nested_len = nla_len(nla) - NLA_ALIGN(len);
-
- if (nested_len >= nla_attr_size(0))
- return nla_parse_deprecated(tb, maxtype,
- nla_data(nla) + NLA_ALIGN(len),
- nested_len, policy, NULL);
-
- memset(tb, 0, sizeof(struct nlattr *) * (maxtype + 1));
- return 0;
-}
-
+/* Parse the other netlink attributes that represent the payload of
+ * TCA_OPTIONS, which are appended right after struct tc_mqprio_qopt.
+ */
static int mqprio_parse_nlattr(struct Qdisc *sch, struct tc_mqprio_qopt *qopt,
struct nlattr *opt)
{
+ struct nlattr *nlattr_opt = nla_data(opt) + NLA_ALIGN(sizeof(*qopt));
+ int nlattr_opt_len = nla_len(opt) - NLA_ALIGN(sizeof(*qopt));
struct mqprio_sched *priv = qdisc_priv(sch);
- struct nlattr *tb[TCA_MQPRIO_MAX + 1];
+ struct nlattr *tb[TCA_MQPRIO_MAX + 1] = {};
struct nlattr *attr;
int i, rem, err;
- err = parse_attr(tb, TCA_MQPRIO_MAX, opt, mqprio_policy,
- sizeof(*qopt));
- if (err < 0)
- return err;
+ if (nlattr_opt_len >= nla_attr_size(0)) {
+ err = nla_parse_deprecated(tb, TCA_MQPRIO_MAX, nlattr_opt,
+ nlattr_opt_len, mqprio_policy,
+ NULL);
+ if (err < 0)
+ return err;
+ }
if (!qopt->hw)
return -EINVAL;
--
2.34.1
Netlink attribute parsing in mqprio is a minesweeper game, with many
options having the possibility of being passed incorrectly and the user
being none the wiser.
Try to make errors less sour by giving user space some information
regarding what went wrong.
Signed-off-by: Vladimir Oltean <[email protected]>
Reviewed-by: Ferenc Fejes <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
---
v1->v4: none
net/sched/sch_mqprio.c | 30 +++++++++++++++++++++++-------
1 file changed, 23 insertions(+), 7 deletions(-)
diff --git a/net/sched/sch_mqprio.c b/net/sched/sch_mqprio.c
index 94093971da5e..5a9261c38b95 100644
--- a/net/sched/sch_mqprio.c
+++ b/net/sched/sch_mqprio.c
@@ -150,7 +150,8 @@ static const struct nla_policy mqprio_policy[TCA_MQPRIO_MAX + 1] = {
* TCA_OPTIONS, which are appended right after struct tc_mqprio_qopt.
*/
static int mqprio_parse_nlattr(struct Qdisc *sch, struct tc_mqprio_qopt *qopt,
- struct nlattr *opt)
+ struct nlattr *opt,
+ struct netlink_ext_ack *extack)
{
struct nlattr *nlattr_opt = nla_data(opt) + NLA_ALIGN(sizeof(*qopt));
int nlattr_opt_len = nla_len(opt) - NLA_ALIGN(sizeof(*qopt));
@@ -167,8 +168,11 @@ static int mqprio_parse_nlattr(struct Qdisc *sch, struct tc_mqprio_qopt *qopt,
return err;
}
- if (!qopt->hw)
+ if (!qopt->hw) {
+ NL_SET_ERR_MSG(extack,
+ "mqprio TCA_OPTIONS can only contain netlink attributes in hardware mode");
return -EINVAL;
+ }
if (tb[TCA_MQPRIO_MODE]) {
priv->flags |= TC_MQPRIO_F_MODE;
@@ -181,13 +185,19 @@ static int mqprio_parse_nlattr(struct Qdisc *sch, struct tc_mqprio_qopt *qopt,
}
if (tb[TCA_MQPRIO_MIN_RATE64]) {
- if (priv->shaper != TC_MQPRIO_SHAPER_BW_RATE)
+ if (priv->shaper != TC_MQPRIO_SHAPER_BW_RATE) {
+ NL_SET_ERR_MSG_ATTR(extack, tb[TCA_MQPRIO_MIN_RATE64],
+ "min_rate accepted only when shaper is in bw_rlimit mode");
return -EINVAL;
+ }
i = 0;
nla_for_each_nested(attr, tb[TCA_MQPRIO_MIN_RATE64],
rem) {
- if (nla_type(attr) != TCA_MQPRIO_MIN_RATE64)
+ if (nla_type(attr) != TCA_MQPRIO_MIN_RATE64) {
+ NL_SET_ERR_MSG_ATTR(extack, attr,
+ "Attribute type expected to be TCA_MQPRIO_MIN_RATE64");
return -EINVAL;
+ }
if (i >= qopt->num_tc)
break;
priv->min_rate[i] = *(u64 *)nla_data(attr);
@@ -197,13 +207,19 @@ static int mqprio_parse_nlattr(struct Qdisc *sch, struct tc_mqprio_qopt *qopt,
}
if (tb[TCA_MQPRIO_MAX_RATE64]) {
- if (priv->shaper != TC_MQPRIO_SHAPER_BW_RATE)
+ if (priv->shaper != TC_MQPRIO_SHAPER_BW_RATE) {
+ NL_SET_ERR_MSG_ATTR(extack, tb[TCA_MQPRIO_MAX_RATE64],
+ "max_rate accepted only when shaper is in bw_rlimit mode");
return -EINVAL;
+ }
i = 0;
nla_for_each_nested(attr, tb[TCA_MQPRIO_MAX_RATE64],
rem) {
- if (nla_type(attr) != TCA_MQPRIO_MAX_RATE64)
+ if (nla_type(attr) != TCA_MQPRIO_MAX_RATE64) {
+ NL_SET_ERR_MSG_ATTR(extack, attr,
+ "Attribute type expected to be TCA_MQPRIO_MAX_RATE64");
return -EINVAL;
+ }
if (i >= qopt->num_tc)
break;
priv->max_rate[i] = *(u64 *)nla_data(attr);
@@ -252,7 +268,7 @@ static int mqprio_init(struct Qdisc *sch, struct nlattr *opt,
len = nla_len(opt) - NLA_ALIGN(sizeof(*qopt));
if (len > 0) {
- err = mqprio_parse_nlattr(sch, qopt, opt);
+ err = mqprio_parse_nlattr(sch, qopt, opt, extack);
if (err)
return err;
}
--
2.34.1
With the multiplexed ndo_setup_tc() model which lacks a first-class
struct netlink_ext_ack * argument, the only way to pass the netlink
extended ACK message down to the device driver is to embed it within the
offload structure.
Do this for struct tc_mqprio_qopt_offload and struct tc_taprio_qopt_offload.
Since struct tc_taprio_qopt_offload also contains a tc_mqprio_qopt_offload
structure, and since device drivers might effectively reuse their mqprio
implementation for the mqprio portion of taprio, we make taprio set the
extack in both offload structures to point at the same netlink extack
message.
In fact, the taprio handling is a bit more tricky, for 2 reasons.
First is because the offload structure has a longer lifetime than the
extack structure. The driver is supposed to populate the extack
synchronously from ndo_setup_tc() and leave it alone afterwards.
To not have any use-after-free surprises, we zero out the extack pointer
when we leave taprio_enable_offload().
The second reason is because taprio does overwrite the extack message on
ndo_setup_tc() error. We need to switch to the weak form of setting an
extack message, which preserves a potential message set by the driver.
Signed-off-by: Vladimir Oltean <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
---
v3->v4: none
v2->v3: patch is new
include/net/pkt_sched.h | 2 ++
net/sched/sch_mqprio.c | 5 ++++-
net/sched/sch_taprio.c | 12 ++++++++++--
3 files changed, 16 insertions(+), 3 deletions(-)
diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index bb0bd69fb655..b43ed4733455 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -166,6 +166,7 @@ struct tc_mqprio_caps {
struct tc_mqprio_qopt_offload {
/* struct tc_mqprio_qopt must always be the first element */
struct tc_mqprio_qopt qopt;
+ struct netlink_ext_ack *extack;
u16 mode;
u16 shaper;
u32 flags;
@@ -193,6 +194,7 @@ struct tc_taprio_sched_entry {
struct tc_taprio_qopt_offload {
struct tc_mqprio_qopt_offload mqprio;
+ struct netlink_ext_ack *extack;
u8 enable;
ktime_t base_time;
u64 cycle_time;
diff --git a/net/sched/sch_mqprio.c b/net/sched/sch_mqprio.c
index 9ee5a9a9b9e9..5287ff60b3f9 100644
--- a/net/sched/sch_mqprio.c
+++ b/net/sched/sch_mqprio.c
@@ -33,9 +33,12 @@ static int mqprio_enable_offload(struct Qdisc *sch,
const struct tc_mqprio_qopt *qopt,
struct netlink_ext_ack *extack)
{
- struct tc_mqprio_qopt_offload mqprio = {.qopt = *qopt};
struct mqprio_sched *priv = qdisc_priv(sch);
struct net_device *dev = qdisc_dev(sch);
+ struct tc_mqprio_qopt_offload mqprio = {
+ .qopt = *qopt,
+ .extack = extack,
+ };
int err, i;
switch (priv->mode) {
diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
index 1f469861eae3..cbad43019172 100644
--- a/net/sched/sch_taprio.c
+++ b/net/sched/sch_taprio.c
@@ -1520,7 +1520,9 @@ static int taprio_enable_offload(struct net_device *dev,
return -ENOMEM;
}
offload->enable = 1;
+ offload->extack = extack;
mqprio_qopt_reconstruct(dev, &offload->mqprio.qopt);
+ offload->mqprio.extack = extack;
taprio_sched_to_offload(dev, sched, offload, &caps);
for (tc = 0; tc < TC_MAX_QUEUE; tc++)
@@ -1528,14 +1530,20 @@ static int taprio_enable_offload(struct net_device *dev,
err = ops->ndo_setup_tc(dev, TC_SETUP_QDISC_TAPRIO, offload);
if (err < 0) {
- NL_SET_ERR_MSG(extack,
- "Device failed to setup taprio offload");
+ NL_SET_ERR_MSG_WEAK(extack,
+ "Device failed to setup taprio offload");
goto done;
}
q->offloaded = true;
done:
+ /* The offload structure may linger around via a reference taken by the
+ * device driver, so clear up the netlink extack pointer so that the
+ * driver isn't tempted to dereference data which stopped being valid
+ */
+ offload->extack = NULL;
+ offload->mqprio.extack = NULL;
taprio_offload_free(offload);
return err;
--
2.34.1
IEEE 802.1Q-2018 clause 6.7.2 Frame preemption specifies that each
packet priority can be assigned to a "frame preemption status" value of
either "express" or "preemptible". Express priorities are transmitted by
the local device through the eMAC, and preemptible priorities through
the pMAC (the concepts of eMAC and pMAC come from the 802.3 MAC Merge
layer).
The FP adminStatus is defined per packet priority, but 802.1Q clause
12.30.1.1.1 framePreemptionAdminStatus also says that:
| Priorities that all map to the same traffic class should be
| constrained to use the same value of preemption status.
It is impossible to ignore the cognitive dissonance in the standard
here, because it practically means that the FP adminStatus only takes
distinct values per traffic class, even though it is defined per
priority.
I can see no valid use case which is prevented by having the kernel take
the FP adminStatus as input per traffic class (what we do here).
In addition, this also enforces the above constraint by construction.
User space network managers which wish to expose FP adminStatus per
priority are free to do so; they must only observe the prio_tc_map of
the netdev (which presumably is also under their control, when
constructing the mqprio netlink attributes).
The reason for configuring frame preemption as a property of the Qdisc
layer is that the information about "preemptible TCs" is closest to the
place which handles the num_tc and prio_tc_map of the netdev. If the
UAPI would have been any other layer, it would be unclear what to do
with the FP information when num_tc collapses to 0. A key assumption is
that only mqprio/taprio change the num_tc and prio_tc_map of the netdev.
Not sure if that's a great assumption to make.
Having FP in tc-mqprio can be seen as an implementation of the use case
defined in 802.1Q Annex S.2 "Preemption used in isolation". There will
be a separate implementation of FP in tc-taprio, for the other use
cases.
Signed-off-by: Vladimir Oltean <[email protected]>
Reviewed-by: Ferenc Fejes <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
---
v3->v4: none
v2->v3: none
v1->v2:
- slightly reword commit message
- move #include <linux/ethtool_netlink.h> to this patch
- remove self-evident comment "only for dump and offloading"
include/net/pkt_sched.h | 1 +
include/uapi/linux/pkt_sched.h | 16 +++++
net/sched/sch_mqprio.c | 127 ++++++++++++++++++++++++++++++++-
net/sched/sch_mqprio_lib.c | 14 ++++
net/sched/sch_mqprio_lib.h | 2 +
5 files changed, 159 insertions(+), 1 deletion(-)
diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index b43ed4733455..f436688b6efc 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -172,6 +172,7 @@ struct tc_mqprio_qopt_offload {
u32 flags;
u64 min_rate[TC_QOPT_MAX_QUEUE];
u64 max_rate[TC_QOPT_MAX_QUEUE];
+ unsigned long preemptible_tcs;
};
struct tc_taprio_caps {
diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 000eec106856..b8d29be91b62 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -719,6 +719,11 @@ enum {
#define __TC_MQPRIO_SHAPER_MAX (__TC_MQPRIO_SHAPER_MAX - 1)
+enum {
+ TC_FP_EXPRESS = 1,
+ TC_FP_PREEMPTIBLE = 2,
+};
+
struct tc_mqprio_qopt {
__u8 num_tc;
__u8 prio_tc_map[TC_QOPT_BITMASK + 1];
@@ -732,12 +737,23 @@ struct tc_mqprio_qopt {
#define TC_MQPRIO_F_MIN_RATE 0x4
#define TC_MQPRIO_F_MAX_RATE 0x8
+enum {
+ TCA_MQPRIO_TC_ENTRY_UNSPEC,
+ TCA_MQPRIO_TC_ENTRY_INDEX, /* u32 */
+ TCA_MQPRIO_TC_ENTRY_FP, /* u32 */
+
+ /* add new constants above here */
+ __TCA_MQPRIO_TC_ENTRY_CNT,
+ TCA_MQPRIO_TC_ENTRY_MAX = (__TCA_MQPRIO_TC_ENTRY_CNT - 1)
+};
+
enum {
TCA_MQPRIO_UNSPEC,
TCA_MQPRIO_MODE,
TCA_MQPRIO_SHAPER,
TCA_MQPRIO_MIN_RATE64,
TCA_MQPRIO_MAX_RATE64,
+ TCA_MQPRIO_TC_ENTRY,
__TCA_MQPRIO_MAX,
};
diff --git a/net/sched/sch_mqprio.c b/net/sched/sch_mqprio.c
index 5287ff60b3f9..bc158a7fd6ba 100644
--- a/net/sched/sch_mqprio.c
+++ b/net/sched/sch_mqprio.c
@@ -5,6 +5,7 @@
* Copyright (c) 2010 John Fastabend <[email protected]>
*/
+#include <linux/ethtool_netlink.h>
#include <linux/types.h>
#include <linux/slab.h>
#include <linux/kernel.h>
@@ -27,6 +28,7 @@ struct mqprio_sched {
u32 flags;
u64 min_rate[TC_QOPT_MAX_QUEUE];
u64 max_rate[TC_QOPT_MAX_QUEUE];
+ u32 fp[TC_QOPT_MAX_QUEUE];
};
static int mqprio_enable_offload(struct Qdisc *sch,
@@ -63,6 +65,8 @@ static int mqprio_enable_offload(struct Qdisc *sch,
return -EINVAL;
}
+ mqprio_fp_to_offload(priv->fp, &mqprio);
+
err = dev->netdev_ops->ndo_setup_tc(dev, TC_SETUP_QDISC_MQPRIO,
&mqprio);
if (err)
@@ -145,13 +149,94 @@ static int mqprio_parse_opt(struct net_device *dev, struct tc_mqprio_qopt *qopt,
return 0;
}
+static const struct
+nla_policy mqprio_tc_entry_policy[TCA_MQPRIO_TC_ENTRY_MAX + 1] = {
+ [TCA_MQPRIO_TC_ENTRY_INDEX] = NLA_POLICY_MAX(NLA_U32,
+ TC_QOPT_MAX_QUEUE),
+ [TCA_MQPRIO_TC_ENTRY_FP] = NLA_POLICY_RANGE(NLA_U32,
+ TC_FP_EXPRESS,
+ TC_FP_PREEMPTIBLE),
+};
+
static const struct nla_policy mqprio_policy[TCA_MQPRIO_MAX + 1] = {
[TCA_MQPRIO_MODE] = { .len = sizeof(u16) },
[TCA_MQPRIO_SHAPER] = { .len = sizeof(u16) },
[TCA_MQPRIO_MIN_RATE64] = { .type = NLA_NESTED },
[TCA_MQPRIO_MAX_RATE64] = { .type = NLA_NESTED },
+ [TCA_MQPRIO_TC_ENTRY] = { .type = NLA_NESTED },
};
+static int mqprio_parse_tc_entry(u32 fp[TC_QOPT_MAX_QUEUE],
+ struct nlattr *opt,
+ unsigned long *seen_tcs,
+ struct netlink_ext_ack *extack)
+{
+ struct nlattr *tb[TCA_MQPRIO_TC_ENTRY_MAX + 1] = { };
+ int err, tc;
+
+ err = nla_parse_nested(tb, TCA_MQPRIO_TC_ENTRY_MAX, opt,
+ mqprio_tc_entry_policy, extack);
+ if (err < 0)
+ return err;
+
+ if (!tb[TCA_MQPRIO_TC_ENTRY_INDEX]) {
+ NL_SET_ERR_MSG(extack, "TC entry index missing");
+ return -EINVAL;
+ }
+
+ tc = nla_get_u32(tb[TCA_MQPRIO_TC_ENTRY_INDEX]);
+ if (*seen_tcs & BIT(tc)) {
+ NL_SET_ERR_MSG(extack, "Duplicate tc entry");
+ return -EINVAL;
+ }
+
+ *seen_tcs |= BIT(tc);
+
+ if (tb[TCA_MQPRIO_TC_ENTRY_FP])
+ fp[tc] = nla_get_u32(tb[TCA_MQPRIO_TC_ENTRY_FP]);
+
+ return 0;
+}
+
+static int mqprio_parse_tc_entries(struct Qdisc *sch, struct nlattr *nlattr_opt,
+ int nlattr_opt_len,
+ struct netlink_ext_ack *extack)
+{
+ struct mqprio_sched *priv = qdisc_priv(sch);
+ struct net_device *dev = qdisc_dev(sch);
+ bool have_preemption = false;
+ unsigned long seen_tcs = 0;
+ u32 fp[TC_QOPT_MAX_QUEUE];
+ struct nlattr *n;
+ int tc, rem;
+ int err = 0;
+
+ for (tc = 0; tc < TC_QOPT_MAX_QUEUE; tc++)
+ fp[tc] = priv->fp[tc];
+
+ nla_for_each_attr(n, nlattr_opt, nlattr_opt_len, rem) {
+ if (nla_type(n) != TCA_MQPRIO_TC_ENTRY)
+ continue;
+
+ err = mqprio_parse_tc_entry(fp, n, &seen_tcs, extack);
+ if (err)
+ goto out;
+ }
+
+ for (tc = 0; tc < TC_QOPT_MAX_QUEUE; tc++) {
+ priv->fp[tc] = fp[tc];
+ if (fp[tc] == TC_FP_PREEMPTIBLE)
+ have_preemption = true;
+ }
+
+ if (have_preemption && !ethtool_dev_mm_supported(dev)) {
+ NL_SET_ERR_MSG(extack, "Device does not support preemption");
+ return -EOPNOTSUPP;
+ }
+out:
+ return err;
+}
+
/* Parse the other netlink attributes that represent the payload of
* TCA_OPTIONS, which are appended right after struct tc_mqprio_qopt.
*/
@@ -234,6 +319,13 @@ static int mqprio_parse_nlattr(struct Qdisc *sch, struct tc_mqprio_qopt *qopt,
priv->flags |= TC_MQPRIO_F_MAX_RATE;
}
+ if (tb[TCA_MQPRIO_TC_ENTRY]) {
+ err = mqprio_parse_tc_entries(sch, nlattr_opt, nlattr_opt_len,
+ extack);
+ if (err)
+ return err;
+ }
+
return 0;
}
@@ -247,7 +339,7 @@ static int mqprio_init(struct Qdisc *sch, struct nlattr *opt,
int i, err = -EOPNOTSUPP;
struct tc_mqprio_qopt *qopt = NULL;
struct tc_mqprio_caps caps;
- int len;
+ int len, tc;
BUILD_BUG_ON(TC_MAX_QUEUE != TC_QOPT_MAX_QUEUE);
BUILD_BUG_ON(TC_BITMASK != TC_QOPT_BITMASK);
@@ -265,6 +357,9 @@ static int mqprio_init(struct Qdisc *sch, struct nlattr *opt,
if (!opt || nla_len(opt) < sizeof(*qopt))
return -EINVAL;
+ for (tc = 0; tc < TC_QOPT_MAX_QUEUE; tc++)
+ priv->fp[tc] = TC_FP_EXPRESS;
+
qdisc_offload_query_caps(dev, TC_SETUP_QDISC_MQPRIO,
&caps, sizeof(caps));
@@ -415,6 +510,33 @@ static int dump_rates(struct mqprio_sched *priv,
return -1;
}
+static int mqprio_dump_tc_entries(struct mqprio_sched *priv,
+ struct sk_buff *skb)
+{
+ struct nlattr *n;
+ int tc;
+
+ for (tc = 0; tc < TC_QOPT_MAX_QUEUE; tc++) {
+ n = nla_nest_start(skb, TCA_MQPRIO_TC_ENTRY);
+ if (!n)
+ return -EMSGSIZE;
+
+ if (nla_put_u32(skb, TCA_MQPRIO_TC_ENTRY_INDEX, tc))
+ goto nla_put_failure;
+
+ if (nla_put_u32(skb, TCA_MQPRIO_TC_ENTRY_FP, priv->fp[tc]))
+ goto nla_put_failure;
+
+ nla_nest_end(skb, n);
+ }
+
+ return 0;
+
+nla_put_failure:
+ nla_nest_cancel(skb, n);
+ return -EMSGSIZE;
+}
+
static int mqprio_dump(struct Qdisc *sch, struct sk_buff *skb)
{
struct net_device *dev = qdisc_dev(sch);
@@ -465,6 +587,9 @@ static int mqprio_dump(struct Qdisc *sch, struct sk_buff *skb)
(dump_rates(priv, &opt, skb) != 0))
goto nla_put_failure;
+ if (mqprio_dump_tc_entries(priv, skb))
+ goto nla_put_failure;
+
return nla_nest_end(skb, nla);
nla_put_failure:
nlmsg_trim(skb, nla);
diff --git a/net/sched/sch_mqprio_lib.c b/net/sched/sch_mqprio_lib.c
index c58a533b8ec5..83b3793c4012 100644
--- a/net/sched/sch_mqprio_lib.c
+++ b/net/sched/sch_mqprio_lib.c
@@ -114,4 +114,18 @@ void mqprio_qopt_reconstruct(struct net_device *dev, struct tc_mqprio_qopt *qopt
}
EXPORT_SYMBOL_GPL(mqprio_qopt_reconstruct);
+void mqprio_fp_to_offload(u32 fp[TC_QOPT_MAX_QUEUE],
+ struct tc_mqprio_qopt_offload *mqprio)
+{
+ unsigned long preemptible_tcs = 0;
+ int tc;
+
+ for (tc = 0; tc < TC_QOPT_MAX_QUEUE; tc++)
+ if (fp[tc] == TC_FP_PREEMPTIBLE)
+ preemptible_tcs |= BIT(tc);
+
+ mqprio->preemptible_tcs = preemptible_tcs;
+}
+EXPORT_SYMBOL_GPL(mqprio_fp_to_offload);
+
MODULE_LICENSE("GPL");
diff --git a/net/sched/sch_mqprio_lib.h b/net/sched/sch_mqprio_lib.h
index 63f725ab8761..079f597072e3 100644
--- a/net/sched/sch_mqprio_lib.h
+++ b/net/sched/sch_mqprio_lib.h
@@ -14,5 +14,7 @@ int mqprio_validate_qopt(struct net_device *dev, struct tc_mqprio_qopt *qopt,
struct netlink_ext_ack *extack);
void mqprio_qopt_reconstruct(struct net_device *dev,
struct tc_mqprio_qopt *qopt);
+void mqprio_fp_to_offload(u32 fp[TC_QOPT_MAX_QUEUE],
+ struct tc_mqprio_qopt_offload *mqprio);
#endif
--
2.34.1
Ferenc reports that a combination of poor iproute2 defaults and obscure
cases where the kernel returns -EINVAL make it difficult to understand
what is wrong with this command:
$ ip link add veth0 numtxqueues 8 numrxqueues 8 type veth peer name veth1
$ tc qdisc add dev veth0 root mqprio num_tc 8 map 0 1 2 3 4 5 6 7 \
queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7
RTNETLINK answers: Invalid argument
Hopefully with this patch, the cause is clearer:
Error: Device does not support hardware offload.
The kernel was (and still is) rejecting this because iproute2 defaults
to "hw 1" if this command line option is not specified.
Link: https://lore.kernel.org/netdev/[email protected]/
Signed-off-by: Vladimir Oltean <[email protected]>
Reviewed-by: Ferenc Fejes <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
---
v3->v4: none
v2->v3: change link from patchwork to lore
v1->v2: slightly reword last paragraph of commit message
net/sched/sch_mqprio.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/net/sched/sch_mqprio.c b/net/sched/sch_mqprio.c
index 5a9261c38b95..9ee5a9a9b9e9 100644
--- a/net/sched/sch_mqprio.c
+++ b/net/sched/sch_mqprio.c
@@ -133,8 +133,11 @@ static int mqprio_parse_opt(struct net_device *dev, struct tc_mqprio_qopt *qopt,
/* If ndo_setup_tc is not present then hardware doesn't support offload
* and we should return an error.
*/
- if (qopt->hw && !dev->netdev_ops->ndo_setup_tc)
+ if (qopt->hw && !dev->netdev_ops->ndo_setup_tc) {
+ NL_SET_ERR_MSG(extack,
+ "Device does not support hardware offload");
return -EINVAL;
+ }
return 0;
}
--
2.34.1
PFs which support the MAC Merge layer also have a set of 8 registers
called "Port traffic class N frame preemption register (PTC0FPR - PTC7FPR)".
Through these, a traffic class (group of TX rings of same dequeue
priority) can be mapped to the eMAC or to the pMAC.
There's nothing particularly spectacular here. We should probably only
commit the preemptible TCs to hardware once the MAC Merge layer became
active, but unlike Felix, we don't have an IRQ that notifies us of that.
We'd have to sleep for up to verifyTime (127 ms) to wait for a
resolution coming from the verification state machine; not only from the
ndo_setup_tc() code path, but also from enetc_mm_link_state_update().
Since it's relatively complicated and has a relatively small benefit,
I'm not doing it.
Signed-off-by: Vladimir Oltean <[email protected]>
Reviewed-by: Ferenc Fejes <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
---
v1->v4: none
drivers/net/ethernet/freescale/enetc/enetc.c | 22 +++++++++++++++++++
drivers/net/ethernet/freescale/enetc/enetc.h | 1 +
.../net/ethernet/freescale/enetc/enetc_hw.h | 4 ++++
3 files changed, 27 insertions(+)
diff --git a/drivers/net/ethernet/freescale/enetc/enetc.c b/drivers/net/ethernet/freescale/enetc/enetc.c
index e0207b01ddd6..41c194c1672d 100644
--- a/drivers/net/ethernet/freescale/enetc/enetc.c
+++ b/drivers/net/ethernet/freescale/enetc/enetc.c
@@ -25,6 +25,24 @@ void enetc_port_mac_wr(struct enetc_si *si, u32 reg, u32 val)
}
EXPORT_SYMBOL_GPL(enetc_port_mac_wr);
+void enetc_set_ptcfpr(struct enetc_hw *hw, unsigned long preemptible_tcs)
+{
+ u32 val;
+ int tc;
+
+ for (tc = 0; tc < 8; tc++) {
+ val = enetc_port_rd(hw, ENETC_PTCFPR(tc));
+
+ if (preemptible_tcs & BIT(tc))
+ val |= ENETC_PTCFPR_FPE;
+ else
+ val &= ~ENETC_PTCFPR_FPE;
+
+ enetc_port_wr(hw, ENETC_PTCFPR(tc), val);
+ }
+}
+EXPORT_SYMBOL_GPL(enetc_set_ptcfpr);
+
static int enetc_num_stack_tx_queues(struct enetc_ndev_priv *priv)
{
int num_tx_rings = priv->num_tx_rings;
@@ -2640,6 +2658,8 @@ static void enetc_reset_tc_mqprio(struct net_device *ndev)
}
enetc_debug_tx_ring_prios(priv);
+
+ enetc_set_ptcfpr(hw, 0);
}
int enetc_setup_tc_mqprio(struct net_device *ndev, void *type_data)
@@ -2694,6 +2714,8 @@ int enetc_setup_tc_mqprio(struct net_device *ndev, void *type_data)
enetc_debug_tx_ring_prios(priv);
+ enetc_set_ptcfpr(hw, mqprio->preemptible_tcs);
+
return 0;
err_reset_tc:
diff --git a/drivers/net/ethernet/freescale/enetc/enetc.h b/drivers/net/ethernet/freescale/enetc/enetc.h
index 8010f31cd10d..143078a9ef16 100644
--- a/drivers/net/ethernet/freescale/enetc/enetc.h
+++ b/drivers/net/ethernet/freescale/enetc/enetc.h
@@ -486,6 +486,7 @@ static inline void enetc_cbd_free_data_mem(struct enetc_si *si, int size,
void enetc_reset_ptcmsdur(struct enetc_hw *hw);
void enetc_set_ptcmsdur(struct enetc_hw *hw, u32 *queue_max_sdu);
+void enetc_set_ptcfpr(struct enetc_hw *hw, unsigned long preemptible_tcs);
#ifdef CONFIG_FSL_ENETC_QOS
int enetc_qos_query_caps(struct net_device *ndev, void *type_data);
diff --git a/drivers/net/ethernet/freescale/enetc/enetc_hw.h b/drivers/net/ethernet/freescale/enetc/enetc_hw.h
index de2e0ee8cdcb..36bb2d6d5658 100644
--- a/drivers/net/ethernet/freescale/enetc/enetc_hw.h
+++ b/drivers/net/ethernet/freescale/enetc/enetc_hw.h
@@ -965,6 +965,10 @@ static inline u32 enetc_usecs_to_cycles(u32 usecs)
return (u32)div_u64(usecs * ENETC_CLK, 1000000ULL);
}
+/* Port traffic class frame preemption register */
+#define ENETC_PTCFPR(n) (0x1910 + (n) * 4) /* n = [0 ..7] */
+#define ENETC_PTCFPR_FPE BIT(31)
+
/* port time gating control register */
#define ENETC_PTGCR 0x11a00
#define ENETC_PTGCR_TGE BIT(31)
--
2.34.1
This is a duplication of the FP adminStatus logic introduced for
tc-mqprio. Offloading is done through the tc_mqprio_qopt_offload
structure embedded within tc_taprio_qopt_offload. So practically, if a
device driver is written to treat the mqprio portion of taprio just like
standalone mqprio, it gets unified handling of frame preemption.
I would have reused more code with taprio, but this is mostly netlink
attribute parsing, which is hard to transform into generic code without
having something that stinks as a result. We have the same variables
with the same semantics, just different nlattr type values
(TCA_MQPRIO_TC_ENTRY=5 vs TCA_TAPRIO_ATTR_TC_ENTRY=12;
TCA_MQPRIO_TC_ENTRY_FP=2 vs TCA_TAPRIO_TC_ENTRY_FP=3, etc) and
consequently, different policies for the nest.
Every time nla_parse_nested() is called, an on-stack table "tb" of
nlattr pointers is allocated statically, up to the maximum understood
nlattr type. That array size is hardcoded as a constant, but when
transforming this into a common parsing function, it would become either
a VLA (which the Linux kernel rightfully doesn't like) or a call to the
allocator.
Having FP adminStatus in tc-taprio can be seen as addressing the 802.1Q
Annex S.3 "Scheduling and preemption used in combination, no HOLD/RELEASE"
and S.4 "Scheduling and preemption used in combination with HOLD/RELEASE"
use cases. HOLD and RELEASE events are emitted towards the underlying
MAC Merge layer when the schedule hits a Set-And-Hold-MAC or a
Set-And-Release-MAC gate operation. So within the tc-taprio UAPI space,
one can distinguish between the 2 use cases by choosing whether to use
the TC_TAPRIO_CMD_SET_AND_HOLD and TC_TAPRIO_CMD_SET_AND_RELEASE gate
operations within the schedule, or just TC_TAPRIO_CMD_SET_GATES.
A small part of the change is dedicated to refactoring the max_sdu
nlattr parsing to put all logic under the "if" that tests for presence
of that nlattr.
Signed-off-by: Vladimir Oltean <[email protected]>
Reviewed-by: Ferenc Fejes <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
---
v3->v4: none
v2->v3: none
v1->v2: slightly reword commit message
include/uapi/linux/pkt_sched.h | 1 +
net/sched/sch_taprio.c | 65 +++++++++++++++++++++++++++-------
2 files changed, 53 insertions(+), 13 deletions(-)
diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index b8d29be91b62..51a7addc56c6 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -1252,6 +1252,7 @@ enum {
TCA_TAPRIO_TC_ENTRY_UNSPEC,
TCA_TAPRIO_TC_ENTRY_INDEX, /* u32 */
TCA_TAPRIO_TC_ENTRY_MAX_SDU, /* u32 */
+ TCA_TAPRIO_TC_ENTRY_FP, /* u32 */
/* add new constants above here */
__TCA_TAPRIO_TC_ENTRY_CNT,
diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
index cbad43019172..76db9a10ef50 100644
--- a/net/sched/sch_taprio.c
+++ b/net/sched/sch_taprio.c
@@ -7,6 +7,7 @@
*/
#include <linux/ethtool.h>
+#include <linux/ethtool_netlink.h>
#include <linux/types.h>
#include <linux/slab.h>
#include <linux/kernel.h>
@@ -96,6 +97,7 @@ struct taprio_sched {
struct list_head taprio_list;
int cur_txq[TC_MAX_QUEUE];
u32 max_sdu[TC_MAX_QUEUE]; /* save info from the user */
+ u32 fp[TC_QOPT_MAX_QUEUE]; /* only for dump and offloading */
u32 txtime_delay;
};
@@ -1002,6 +1004,9 @@ static const struct nla_policy entry_policy[TCA_TAPRIO_SCHED_ENTRY_MAX + 1] = {
static const struct nla_policy taprio_tc_policy[TCA_TAPRIO_TC_ENTRY_MAX + 1] = {
[TCA_TAPRIO_TC_ENTRY_INDEX] = { .type = NLA_U32 },
[TCA_TAPRIO_TC_ENTRY_MAX_SDU] = { .type = NLA_U32 },
+ [TCA_TAPRIO_TC_ENTRY_FP] = NLA_POLICY_RANGE(NLA_U32,
+ TC_FP_EXPRESS,
+ TC_FP_PREEMPTIBLE),
};
static const struct nla_policy taprio_policy[TCA_TAPRIO_ATTR_MAX + 1] = {
@@ -1524,6 +1529,7 @@ static int taprio_enable_offload(struct net_device *dev,
mqprio_qopt_reconstruct(dev, &offload->mqprio.qopt);
offload->mqprio.extack = extack;
taprio_sched_to_offload(dev, sched, offload, &caps);
+ mqprio_fp_to_offload(q->fp, &offload->mqprio);
for (tc = 0; tc < TC_MAX_QUEUE; tc++)
offload->max_sdu[tc] = q->max_sdu[tc];
@@ -1671,13 +1677,14 @@ static int taprio_parse_clockid(struct Qdisc *sch, struct nlattr **tb,
static int taprio_parse_tc_entry(struct Qdisc *sch,
struct nlattr *opt,
u32 max_sdu[TC_QOPT_MAX_QUEUE],
+ u32 fp[TC_QOPT_MAX_QUEUE],
unsigned long *seen_tcs,
struct netlink_ext_ack *extack)
{
struct nlattr *tb[TCA_TAPRIO_TC_ENTRY_MAX + 1] = { };
struct net_device *dev = qdisc_dev(sch);
- u32 val = 0;
int err, tc;
+ u32 val;
err = nla_parse_nested(tb, TCA_TAPRIO_TC_ENTRY_MAX, opt,
taprio_tc_policy, extack);
@@ -1702,15 +1709,18 @@ static int taprio_parse_tc_entry(struct Qdisc *sch,
*seen_tcs |= BIT(tc);
- if (tb[TCA_TAPRIO_TC_ENTRY_MAX_SDU])
+ if (tb[TCA_TAPRIO_TC_ENTRY_MAX_SDU]) {
val = nla_get_u32(tb[TCA_TAPRIO_TC_ENTRY_MAX_SDU]);
+ if (val > dev->max_mtu) {
+ NL_SET_ERR_MSG_MOD(extack, "TC max SDU exceeds device max MTU");
+ return -ERANGE;
+ }
- if (val > dev->max_mtu) {
- NL_SET_ERR_MSG_MOD(extack, "TC max SDU exceeds device max MTU");
- return -ERANGE;
+ max_sdu[tc] = val;
}
- max_sdu[tc] = val;
+ if (tb[TCA_TAPRIO_TC_ENTRY_FP])
+ fp[tc] = nla_get_u32(tb[TCA_TAPRIO_TC_ENTRY_FP]);
return 0;
}
@@ -1720,29 +1730,51 @@ static int taprio_parse_tc_entries(struct Qdisc *sch,
struct netlink_ext_ack *extack)
{
struct taprio_sched *q = qdisc_priv(sch);
+ struct net_device *dev = qdisc_dev(sch);
u32 max_sdu[TC_QOPT_MAX_QUEUE];
+ bool have_preemption = false;
unsigned long seen_tcs = 0;
+ u32 fp[TC_QOPT_MAX_QUEUE];
struct nlattr *n;
int tc, rem;
int err = 0;
- for (tc = 0; tc < TC_QOPT_MAX_QUEUE; tc++)
+ for (tc = 0; tc < TC_QOPT_MAX_QUEUE; tc++) {
max_sdu[tc] = q->max_sdu[tc];
+ fp[tc] = q->fp[tc];
+ }
nla_for_each_nested(n, opt, rem) {
if (nla_type(n) != TCA_TAPRIO_ATTR_TC_ENTRY)
continue;
- err = taprio_parse_tc_entry(sch, n, max_sdu, &seen_tcs,
+ err = taprio_parse_tc_entry(sch, n, max_sdu, fp, &seen_tcs,
extack);
if (err)
- goto out;
+ return err;
}
- for (tc = 0; tc < TC_QOPT_MAX_QUEUE; tc++)
+ for (tc = 0; tc < TC_QOPT_MAX_QUEUE; tc++) {
q->max_sdu[tc] = max_sdu[tc];
+ q->fp[tc] = fp[tc];
+ if (fp[tc] != TC_FP_EXPRESS)
+ have_preemption = true;
+ }
+
+ if (have_preemption) {
+ if (!FULL_OFFLOAD_IS_ENABLED(q->flags)) {
+ NL_SET_ERR_MSG(extack,
+ "Preemption only supported with full offload");
+ return -EOPNOTSUPP;
+ }
+
+ if (!ethtool_dev_mm_supported(dev)) {
+ NL_SET_ERR_MSG(extack,
+ "Device does not support preemption");
+ return -EOPNOTSUPP;
+ }
+ }
-out:
return err;
}
@@ -2023,7 +2055,7 @@ static int taprio_init(struct Qdisc *sch, struct nlattr *opt,
{
struct taprio_sched *q = qdisc_priv(sch);
struct net_device *dev = qdisc_dev(sch);
- int i;
+ int i, tc;
spin_lock_init(&q->current_entry_lock);
@@ -2080,6 +2112,9 @@ static int taprio_init(struct Qdisc *sch, struct nlattr *opt,
q->qdiscs[i] = qdisc;
}
+ for (tc = 0; tc < TC_QOPT_MAX_QUEUE; tc++)
+ q->fp[tc] = TC_FP_EXPRESS;
+
taprio_detect_broken_mqprio(q);
return taprio_change(sch, opt, extack);
@@ -2223,6 +2258,7 @@ static int dump_schedule(struct sk_buff *msg,
}
static int taprio_dump_tc_entries(struct sk_buff *skb,
+ struct taprio_sched *q,
struct sched_gate_list *sched)
{
struct nlattr *n;
@@ -2240,6 +2276,9 @@ static int taprio_dump_tc_entries(struct sk_buff *skb,
sched->max_sdu[tc]))
goto nla_put_failure;
+ if (nla_put_u32(skb, TCA_TAPRIO_TC_ENTRY_FP, q->fp[tc]))
+ goto nla_put_failure;
+
nla_nest_end(skb, n);
}
@@ -2281,7 +2320,7 @@ static int taprio_dump(struct Qdisc *sch, struct sk_buff *skb)
nla_put_u32(skb, TCA_TAPRIO_ATTR_TXTIME_DELAY, q->txtime_delay))
goto options_error;
- if (oper && taprio_dump_tc_entries(skb, oper))
+ if (oper && taprio_dump_tc_entries(skb, q, oper))
goto options_error;
if (oper && dump_schedule(skb, oper))
--
2.34.1
To gain access to the larger encapsulating structure which has the type
tc_mqprio_qopt_offload, rename just the "qopt" field as "qopt".
Signed-off-by: Vladimir Oltean <[email protected]>
Reviewed-by: Ferenc Fejes <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
---
v1->v4: none
drivers/net/ethernet/freescale/enetc/enetc.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/drivers/net/ethernet/freescale/enetc/enetc.c b/drivers/net/ethernet/freescale/enetc/enetc.c
index 2fc712b24d12..e0207b01ddd6 100644
--- a/drivers/net/ethernet/freescale/enetc/enetc.c
+++ b/drivers/net/ethernet/freescale/enetc/enetc.c
@@ -2644,12 +2644,13 @@ static void enetc_reset_tc_mqprio(struct net_device *ndev)
int enetc_setup_tc_mqprio(struct net_device *ndev, void *type_data)
{
+ struct tc_mqprio_qopt_offload *mqprio = type_data;
struct enetc_ndev_priv *priv = netdev_priv(ndev);
- struct tc_mqprio_qopt *mqprio = type_data;
+ struct tc_mqprio_qopt *qopt = &mqprio->qopt;
struct enetc_hw *hw = &priv->si->hw;
int num_stack_tx_queues = 0;
- u8 num_tc = mqprio->num_tc;
struct enetc_bdr *tx_ring;
+ u8 num_tc = qopt->num_tc;
int offset, count;
int err, tc, q;
@@ -2663,8 +2664,8 @@ int enetc_setup_tc_mqprio(struct net_device *ndev, void *type_data)
return err;
for (tc = 0; tc < num_tc; tc++) {
- offset = mqprio->offset[tc];
- count = mqprio->count[tc];
+ offset = qopt->offset[tc];
+ count = qopt->count[tc];
num_stack_tx_queues += count;
err = netdev_set_tc_queue(ndev, tc, count, offset);
--
2.34.1
On Mon, Apr 03, 2023 at 01:34:31PM +0300, Vladimir Oltean wrote:
> This series proposes that we use the Qdisc layer, through separate
> (albeit very similar) UAPI in mqprio and taprio, and that both these
> Qdiscs pass the information down to the offloading device driver through
> the common mqprio offload structure (which taprio also passes).
For those interested, the iproute2 companion patch set is available here:
https://patchwork.kernel.org/project/netdevbpf/cover/[email protected]/
On Mon, Apr 03, 2023 at 01:34:31PM +0300, Vladimir Oltean wrote:
> This series proposes that we use the Qdisc layer, through separate
> (albeit very similar) UAPI in mqprio and taprio, and that both these
> Qdiscs pass the information down to the offloading device driver through
> the common mqprio offload structure (which taprio also passes).
>
> An implementation is provided for the NXP LS1028A on-board Ethernet
> endpoint (enetc). Previous versions also contained support for its
> embedded switch (felix), but this needs more work and will be submitted
> separately.
+Claudiu. Sorry, it wasn't intentional. I removed the DSA maintainers
and the Felix driver maintainers, forgetting that Claudiu is a maintainer
for both Felix and ENETC, and thus, his refcount should stay 1 :)
On another note, this patch set just got superseded in patchwork:
https://patchwork.kernel.org/project/netdevbpf/cover/[email protected]/
after I submitted an iproute2 patch set with the same name:
https://patchwork.kernel.org/project/netdevbpf/cover/[email protected]/
I think there's a namespacing problem in patchwork's series detection
algorithm ("net-next" is not "iproute2-next", and so, it is valid to
have both in flight) but I don't know where to look to fix that.
Jakub, could you perhaps help, please?
On Mon, 3 Apr 2023 14:04:58 +0300 Vladimir Oltean wrote:
> On another note, this patch set just got superseded in patchwork:
> https://patchwork.kernel.org/project/netdevbpf/cover/[email protected]/
> after I submitted an iproute2 patch set with the same name:
> https://patchwork.kernel.org/project/netdevbpf/cover/[email protected]/
>
> I think there's a namespacing problem in patchwork's series detection
> algorithm ("net-next" is not "iproute2-next", and so, it is valid to
> have both in flight) but I don't know where to look to fix that.
> Jakub, could you perhaps help, please?
I revived the series. I'm a bit weary about asking Konstantin to make
the pw-bot compare tree tags because people change trees all the time
(especially no tree -> net-next / net) and he would have to filter out
the version.. It's gonna get wobbly. Let's see if the problem gets more
common.
On Mon, Apr 03, 2023 at 02:32:29PM -0700, Jakub Kicinski wrote:
> On Mon, 3 Apr 2023 14:04:58 +0300 Vladimir Oltean wrote:
> > On another note, this patch set just got superseded in patchwork:
> > https://patchwork.kernel.org/project/netdevbpf/cover/[email protected]/
> > after I submitted an iproute2 patch set with the same name:
> > https://patchwork.kernel.org/project/netdevbpf/cover/[email protected]/
> >
> > I think there's a namespacing problem in patchwork's series detection
> > algorithm ("net-next" is not "iproute2-next", and so, it is valid to
> > have both in flight) but I don't know where to look to fix that.
> > Jakub, could you perhaps help, please?
>
> I revived the series. I'm a bit weary about asking Konstantin to make
> the pw-bot compare tree tags because people change trees all the time
> (especially no tree -> net-next / net) and he would have to filter out
> the version.. It's gonna get wobbly. Let's see if the problem gets more
> common.
Thanks. Was it supposed to change state? Because it's still "superseded".
Let's wait for a few more days before merging, anyway, just in case there
are any other comments from the more TSN-oriented folks. I'm still not
completely happy with the UAPI duplication in 2 qdiscs, and no indication
that the duplication would stop at 2. For example, if I understand tc-etf
correctly, it would be possible to see bands as traffic classes, and so,
talk about preemptible bands and end up adding UAPI for those too.
On Tue, 4 Apr 2023 02:43:39 +0300 Vladimir Oltean wrote:
> > I revived the series. I'm a bit weary about asking Konstantin to make
> > the pw-bot compare tree tags because people change trees all the time
> > (especially no tree -> net-next / net) and he would have to filter out
> > the version.. It's gonna get wobbly. Let's see if the problem gets more
> > common.
>
> Thanks. Was it supposed to change state? Because it's still "superseded".
Argh, the bot keeps rescanning and re-marking it as Superseded :(
We have a backup patch tracking method of ... what's unread in
my inbox. So we should be able to rely on that.
On Mon, 3 Apr 2023 13:34:37 +0300 Vladimir Oltean wrote:
> +static int mqprio_parse_tc_entry(u32 fp[TC_QOPT_MAX_QUEUE],
> + struct nlattr *opt,
> + unsigned long *seen_tcs,
> + struct netlink_ext_ack *extack)
> +{
> + struct nlattr *tb[TCA_MQPRIO_TC_ENTRY_MAX + 1] = { };
nit: no need to clear it nla_parse*() zeros the memory
> + int err, tc;
> +
> + err = nla_parse_nested(tb, TCA_MQPRIO_TC_ENTRY_MAX, opt,
> + mqprio_tc_entry_policy, extack);
> + if (err < 0)
> + return err;
> +
> + if (!tb[TCA_MQPRIO_TC_ENTRY_INDEX]) {
> + NL_SET_ERR_MSG(extack, "TC entry index missing");
Are you not using NL_REQ_ATTR_CHECK() because iproute can't actually
parse the result? :(
> + return -EINVAL;
> + }
> +
> + tc = nla_get_u32(tb[TCA_MQPRIO_TC_ENTRY_INDEX]);
> + if (*seen_tcs & BIT(tc)) {
> + NL_SET_ERR_MSG(extack, "Duplicate tc entry");
set attr in extack?
minor heads up - I'll take the trivial cleanup patch from Pedro
so make sure you rebase:
https://lore.kernel.org/all/[email protected]/
On Mon, Apr 3, 2023 at 6:35 AM Vladimir Oltean <[email protected]> wrote:
>
> In commit 4e8b86c06269 ("mqprio: Introduce new hardware offload mode and
> shaper in mqprio"), the TCA_OPTIONS format of mqprio was extended to
> contain a fixed portion (of size NLA_ALIGN(sizeof struct tc_mqprio_qopt))
> and a variable portion of other nlattrs (in the TCA_MQPRIO_* type space)
> following immediately afterwards.
>
> In commit feb2cf3dcfb9 ("net/sched: mqprio: refactor nlattr parsing to a
> separate function"), we've moved the nlattr handling to a smaller
> function, but yet, a small parse_attr() still remains, and the larger
> mqprio_parse_nlattr() still does not have access to the beginning, and
> the length, of the TCA_OPTIONS region containing these other nlattrs.
>
> In a future change, the mqprio qdisc will need to iterate through this
> nlattr region to discover other attributes, so eliminate parse_attr()
> and add 2 variables in mqprio_parse_nlattr() which hold the beginning
> and the length of the nlattr range.
>
> We avoid the need to memset when nlattr_opt_len has insufficient length
> by pre-initializing the table "tb".
>
> Signed-off-by: Vladimir Oltean <[email protected]>
> Reviewed-by: Ferenc Fejes <[email protected]>
> Reviewed-by: Simon Horman <[email protected]>
Acked-by: Jamal Hadi Salim <[email protected]>
cheers,
jamal
> ---
> v1->v4: none
>
> net/sched/sch_mqprio.c | 32 +++++++++++++-------------------
> 1 file changed, 13 insertions(+), 19 deletions(-)
>
> diff --git a/net/sched/sch_mqprio.c b/net/sched/sch_mqprio.c
> index 48ed87b91086..94093971da5e 100644
> --- a/net/sched/sch_mqprio.c
> +++ b/net/sched/sch_mqprio.c
> @@ -146,32 +146,26 @@ static const struct nla_policy mqprio_policy[TCA_MQPRIO_MAX + 1] = {
> [TCA_MQPRIO_MAX_RATE64] = { .type = NLA_NESTED },
> };
>
> -static int parse_attr(struct nlattr *tb[], int maxtype, struct nlattr *nla,
> - const struct nla_policy *policy, int len)
> -{
> - int nested_len = nla_len(nla) - NLA_ALIGN(len);
> -
> - if (nested_len >= nla_attr_size(0))
> - return nla_parse_deprecated(tb, maxtype,
> - nla_data(nla) + NLA_ALIGN(len),
> - nested_len, policy, NULL);
> -
> - memset(tb, 0, sizeof(struct nlattr *) * (maxtype + 1));
> - return 0;
> -}
> -
> +/* Parse the other netlink attributes that represent the payload of
> + * TCA_OPTIONS, which are appended right after struct tc_mqprio_qopt.
> + */
> static int mqprio_parse_nlattr(struct Qdisc *sch, struct tc_mqprio_qopt *qopt,
> struct nlattr *opt)
> {
> + struct nlattr *nlattr_opt = nla_data(opt) + NLA_ALIGN(sizeof(*qopt));
> + int nlattr_opt_len = nla_len(opt) - NLA_ALIGN(sizeof(*qopt));
> struct mqprio_sched *priv = qdisc_priv(sch);
> - struct nlattr *tb[TCA_MQPRIO_MAX + 1];
> + struct nlattr *tb[TCA_MQPRIO_MAX + 1] = {};
> struct nlattr *attr;
> int i, rem, err;
>
> - err = parse_attr(tb, TCA_MQPRIO_MAX, opt, mqprio_policy,
> - sizeof(*qopt));
> - if (err < 0)
> - return err;
> + if (nlattr_opt_len >= nla_attr_size(0)) {
> + err = nla_parse_deprecated(tb, TCA_MQPRIO_MAX, nlattr_opt,
> + nlattr_opt_len, mqprio_policy,
> + NULL);
> + if (err < 0)
> + return err;
> + }
>
> if (!qopt->hw)
> return -EINVAL;
> --
> 2.34.1
>
On Mon, Apr 3, 2023 at 6:35 AM Vladimir Oltean <[email protected]> wrote:
>
> Netlink attribute parsing in mqprio is a minesweeper game, with many
> options having the possibility of being passed incorrectly and the user
> being none the wiser.
>
> Try to make errors less sour by giving user space some information
> regarding what went wrong.
>
> Signed-off-by: Vladimir Oltean <[email protected]>
> Reviewed-by: Ferenc Fejes <[email protected]>
> Reviewed-by: Simon Horman <[email protected]>
Acked-by: Jamal Hadi Salim <[email protected]>
cheers,
jamal
> ---
> v1->v4: none
>
> net/sched/sch_mqprio.c | 30 +++++++++++++++++++++++-------
> 1 file changed, 23 insertions(+), 7 deletions(-)
>
> diff --git a/net/sched/sch_mqprio.c b/net/sched/sch_mqprio.c
> index 94093971da5e..5a9261c38b95 100644
> --- a/net/sched/sch_mqprio.c
> +++ b/net/sched/sch_mqprio.c
> @@ -150,7 +150,8 @@ static const struct nla_policy mqprio_policy[TCA_MQPRIO_MAX + 1] = {
> * TCA_OPTIONS, which are appended right after struct tc_mqprio_qopt.
> */
> static int mqprio_parse_nlattr(struct Qdisc *sch, struct tc_mqprio_qopt *qopt,
> - struct nlattr *opt)
> + struct nlattr *opt,
> + struct netlink_ext_ack *extack)
> {
> struct nlattr *nlattr_opt = nla_data(opt) + NLA_ALIGN(sizeof(*qopt));
> int nlattr_opt_len = nla_len(opt) - NLA_ALIGN(sizeof(*qopt));
> @@ -167,8 +168,11 @@ static int mqprio_parse_nlattr(struct Qdisc *sch, struct tc_mqprio_qopt *qopt,
> return err;
> }
>
> - if (!qopt->hw)
> + if (!qopt->hw) {
> + NL_SET_ERR_MSG(extack,
> + "mqprio TCA_OPTIONS can only contain netlink attributes in hardware mode");
> return -EINVAL;
> + }
>
> if (tb[TCA_MQPRIO_MODE]) {
> priv->flags |= TC_MQPRIO_F_MODE;
> @@ -181,13 +185,19 @@ static int mqprio_parse_nlattr(struct Qdisc *sch, struct tc_mqprio_qopt *qopt,
> }
>
> if (tb[TCA_MQPRIO_MIN_RATE64]) {
> - if (priv->shaper != TC_MQPRIO_SHAPER_BW_RATE)
> + if (priv->shaper != TC_MQPRIO_SHAPER_BW_RATE) {
> + NL_SET_ERR_MSG_ATTR(extack, tb[TCA_MQPRIO_MIN_RATE64],
> + "min_rate accepted only when shaper is in bw_rlimit mode");
> return -EINVAL;
> + }
> i = 0;
> nla_for_each_nested(attr, tb[TCA_MQPRIO_MIN_RATE64],
> rem) {
> - if (nla_type(attr) != TCA_MQPRIO_MIN_RATE64)
> + if (nla_type(attr) != TCA_MQPRIO_MIN_RATE64) {
> + NL_SET_ERR_MSG_ATTR(extack, attr,
> + "Attribute type expected to be TCA_MQPRIO_MIN_RATE64");
> return -EINVAL;
> + }
> if (i >= qopt->num_tc)
> break;
> priv->min_rate[i] = *(u64 *)nla_data(attr);
> @@ -197,13 +207,19 @@ static int mqprio_parse_nlattr(struct Qdisc *sch, struct tc_mqprio_qopt *qopt,
> }
>
> if (tb[TCA_MQPRIO_MAX_RATE64]) {
> - if (priv->shaper != TC_MQPRIO_SHAPER_BW_RATE)
> + if (priv->shaper != TC_MQPRIO_SHAPER_BW_RATE) {
> + NL_SET_ERR_MSG_ATTR(extack, tb[TCA_MQPRIO_MAX_RATE64],
> + "max_rate accepted only when shaper is in bw_rlimit mode");
> return -EINVAL;
> + }
> i = 0;
> nla_for_each_nested(attr, tb[TCA_MQPRIO_MAX_RATE64],
> rem) {
> - if (nla_type(attr) != TCA_MQPRIO_MAX_RATE64)
> + if (nla_type(attr) != TCA_MQPRIO_MAX_RATE64) {
> + NL_SET_ERR_MSG_ATTR(extack, attr,
> + "Attribute type expected to be TCA_MQPRIO_MAX_RATE64");
> return -EINVAL;
> + }
> if (i >= qopt->num_tc)
> break;
> priv->max_rate[i] = *(u64 *)nla_data(attr);
> @@ -252,7 +268,7 @@ static int mqprio_init(struct Qdisc *sch, struct nlattr *opt,
>
> len = nla_len(opt) - NLA_ALIGN(sizeof(*qopt));
> if (len > 0) {
> - err = mqprio_parse_nlattr(sch, qopt, opt);
> + err = mqprio_parse_nlattr(sch, qopt, opt, extack);
> if (err)
> return err;
> }
> --
> 2.34.1
>
On Mon, Apr 3, 2023 at 6:35 AM Vladimir Oltean <[email protected]> wrote:
>
> With the multiplexed ndo_setup_tc() model which lacks a first-class
> struct netlink_ext_ack * argument, the only way to pass the netlink
> extended ACK message down to the device driver is to embed it within the
> offload structure.
>
> Do this for struct tc_mqprio_qopt_offload and struct tc_taprio_qopt_offload.
>
> Since struct tc_taprio_qopt_offload also contains a tc_mqprio_qopt_offload
> structure, and since device drivers might effectively reuse their mqprio
> implementation for the mqprio portion of taprio, we make taprio set the
> extack in both offload structures to point at the same netlink extack
> message.
>
> In fact, the taprio handling is a bit more tricky, for 2 reasons.
>
> First is because the offload structure has a longer lifetime than the
> extack structure. The driver is supposed to populate the extack
> synchronously from ndo_setup_tc() and leave it alone afterwards.
> To not have any use-after-free surprises, we zero out the extack pointer
> when we leave taprio_enable_offload().
>
> The second reason is because taprio does overwrite the extack message on
> ndo_setup_tc() error. We need to switch to the weak form of setting an
> extack message, which preserves a potential message set by the driver.
>
> Signed-off-by: Vladimir Oltean <[email protected]>
> Reviewed-by: Simon Horman <[email protected]>
Acked-by: Jamal Hadi Salim <[email protected]>
cheers,
jamal
> ---
> v3->v4: none
> v2->v3: patch is new
>
> include/net/pkt_sched.h | 2 ++
> net/sched/sch_mqprio.c | 5 ++++-
> net/sched/sch_taprio.c | 12 ++++++++++--
> 3 files changed, 16 insertions(+), 3 deletions(-)
>
> diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
> index bb0bd69fb655..b43ed4733455 100644
> --- a/include/net/pkt_sched.h
> +++ b/include/net/pkt_sched.h
> @@ -166,6 +166,7 @@ struct tc_mqprio_caps {
> struct tc_mqprio_qopt_offload {
> /* struct tc_mqprio_qopt must always be the first element */
> struct tc_mqprio_qopt qopt;
> + struct netlink_ext_ack *extack;
> u16 mode;
> u16 shaper;
> u32 flags;
> @@ -193,6 +194,7 @@ struct tc_taprio_sched_entry {
>
> struct tc_taprio_qopt_offload {
> struct tc_mqprio_qopt_offload mqprio;
> + struct netlink_ext_ack *extack;
> u8 enable;
> ktime_t base_time;
> u64 cycle_time;
> diff --git a/net/sched/sch_mqprio.c b/net/sched/sch_mqprio.c
> index 9ee5a9a9b9e9..5287ff60b3f9 100644
> --- a/net/sched/sch_mqprio.c
> +++ b/net/sched/sch_mqprio.c
> @@ -33,9 +33,12 @@ static int mqprio_enable_offload(struct Qdisc *sch,
> const struct tc_mqprio_qopt *qopt,
> struct netlink_ext_ack *extack)
> {
> - struct tc_mqprio_qopt_offload mqprio = {.qopt = *qopt};
> struct mqprio_sched *priv = qdisc_priv(sch);
> struct net_device *dev = qdisc_dev(sch);
> + struct tc_mqprio_qopt_offload mqprio = {
> + .qopt = *qopt,
> + .extack = extack,
> + };
> int err, i;
>
> switch (priv->mode) {
> diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
> index 1f469861eae3..cbad43019172 100644
> --- a/net/sched/sch_taprio.c
> +++ b/net/sched/sch_taprio.c
> @@ -1520,7 +1520,9 @@ static int taprio_enable_offload(struct net_device *dev,
> return -ENOMEM;
> }
> offload->enable = 1;
> + offload->extack = extack;
> mqprio_qopt_reconstruct(dev, &offload->mqprio.qopt);
> + offload->mqprio.extack = extack;
> taprio_sched_to_offload(dev, sched, offload, &caps);
>
> for (tc = 0; tc < TC_MAX_QUEUE; tc++)
> @@ -1528,14 +1530,20 @@ static int taprio_enable_offload(struct net_device *dev,
>
> err = ops->ndo_setup_tc(dev, TC_SETUP_QDISC_TAPRIO, offload);
> if (err < 0) {
> - NL_SET_ERR_MSG(extack,
> - "Device failed to setup taprio offload");
> + NL_SET_ERR_MSG_WEAK(extack,
> + "Device failed to setup taprio offload");
> goto done;
> }
>
> q->offloaded = true;
>
> done:
> + /* The offload structure may linger around via a reference taken by the
> + * device driver, so clear up the netlink extack pointer so that the
> + * driver isn't tempted to dereference data which stopped being valid
> + */
> + offload->extack = NULL;
> + offload->mqprio.extack = NULL;
> taprio_offload_free(offload);
>
> return err;
> --
> 2.34.1
>
On Mon, Apr 3, 2023 at 6:35 AM Vladimir Oltean <[email protected]> wrote:
>
> Ferenc reports that a combination of poor iproute2 defaults and obscure
> cases where the kernel returns -EINVAL make it difficult to understand
> what is wrong with this command:
>
> $ ip link add veth0 numtxqueues 8 numrxqueues 8 type veth peer name veth1
> $ tc qdisc add dev veth0 root mqprio num_tc 8 map 0 1 2 3 4 5 6 7 \
> queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7
> RTNETLINK answers: Invalid argument
>
> Hopefully with this patch, the cause is clearer:
>
> Error: Device does not support hardware offload.
>
> The kernel was (and still is) rejecting this because iproute2 defaults
> to "hw 1" if this command line option is not specified.
>
> Link: https://lore.kernel.org/netdev/[email protected]/
> Signed-off-by: Vladimir Oltean <[email protected]>
> Reviewed-by: Ferenc Fejes <[email protected]>
> Reviewed-by: Simon Horman <[email protected]>
Acked-by: Jamal Hadi Salim <[email protected]>
cheers,
jamal
> ---
> v3->v4: none
> v2->v3: change link from patchwork to lore
> v1->v2: slightly reword last paragraph of commit message
>
> net/sched/sch_mqprio.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/net/sched/sch_mqprio.c b/net/sched/sch_mqprio.c
> index 5a9261c38b95..9ee5a9a9b9e9 100644
> --- a/net/sched/sch_mqprio.c
> +++ b/net/sched/sch_mqprio.c
> @@ -133,8 +133,11 @@ static int mqprio_parse_opt(struct net_device *dev, struct tc_mqprio_qopt *qopt,
> /* If ndo_setup_tc is not present then hardware doesn't support offload
> * and we should return an error.
> */
> - if (qopt->hw && !dev->netdev_ops->ndo_setup_tc)
> + if (qopt->hw && !dev->netdev_ops->ndo_setup_tc) {
> + NL_SET_ERR_MSG(extack,
> + "Device does not support hardware offload");
> return -EINVAL;
> + }
>
> return 0;
> }
> --
> 2.34.1
>
On Mon, Apr 3, 2023 at 6:35 AM Vladimir Oltean <[email protected]> wrote:
>
> IEEE 802.1Q-2018 clause 6.7.2 Frame preemption specifies that each
> packet priority can be assigned to a "frame preemption status" value of
> either "express" or "preemptible". Express priorities are transmitted by
> the local device through the eMAC, and preemptible priorities through
> the pMAC (the concepts of eMAC and pMAC come from the 802.3 MAC Merge
> layer).
>
> The FP adminStatus is defined per packet priority, but 802.1Q clause
> 12.30.1.1.1 framePreemptionAdminStatus also says that:
>
> | Priorities that all map to the same traffic class should be
> | constrained to use the same value of preemption status.
>
> It is impossible to ignore the cognitive dissonance in the standard
> here, because it practically means that the FP adminStatus only takes
> distinct values per traffic class, even though it is defined per
> priority.
>
> I can see no valid use case which is prevented by having the kernel take
> the FP adminStatus as input per traffic class (what we do here).
> In addition, this also enforces the above constraint by construction.
> User space network managers which wish to expose FP adminStatus per
> priority are free to do so; they must only observe the prio_tc_map of
> the netdev (which presumably is also under their control, when
> constructing the mqprio netlink attributes).
>
> The reason for configuring frame preemption as a property of the Qdisc
> layer is that the information about "preemptible TCs" is closest to the
> place which handles the num_tc and prio_tc_map of the netdev. If the
> UAPI would have been any other layer, it would be unclear what to do
> with the FP information when num_tc collapses to 0. A key assumption is
> that only mqprio/taprio change the num_tc and prio_tc_map of the netdev.
> Not sure if that's a great assumption to make.
>
> Having FP in tc-mqprio can be seen as an implementation of the use case
> defined in 802.1Q Annex S.2 "Preemption used in isolation". There will
> be a separate implementation of FP in tc-taprio, for the other use
> cases.
>
> Signed-off-by: Vladimir Oltean <[email protected]>
> Reviewed-by: Ferenc Fejes <[email protected]>
> Reviewed-by: Simon Horman <[email protected]>
> ---
> v3->v4: none
> v2->v3: none
> v1->v2:
> - slightly reword commit message
> - move #include <linux/ethtool_netlink.h> to this patch
> - remove self-evident comment "only for dump and offloading"
>
> include/net/pkt_sched.h | 1 +
> include/uapi/linux/pkt_sched.h | 16 +++++
> net/sched/sch_mqprio.c | 127 ++++++++++++++++++++++++++++++++-
> net/sched/sch_mqprio_lib.c | 14 ++++
> net/sched/sch_mqprio_lib.h | 2 +
> 5 files changed, 159 insertions(+), 1 deletion(-)
>
> diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
> index b43ed4733455..f436688b6efc 100644
> --- a/include/net/pkt_sched.h
> +++ b/include/net/pkt_sched.h
> @@ -172,6 +172,7 @@ struct tc_mqprio_qopt_offload {
> u32 flags;
> u64 min_rate[TC_QOPT_MAX_QUEUE];
> u64 max_rate[TC_QOPT_MAX_QUEUE];
> + unsigned long preemptible_tcs;
> };
>
> struct tc_taprio_caps {
> diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
> index 000eec106856..b8d29be91b62 100644
> --- a/include/uapi/linux/pkt_sched.h
> +++ b/include/uapi/linux/pkt_sched.h
> @@ -719,6 +719,11 @@ enum {
>
> #define __TC_MQPRIO_SHAPER_MAX (__TC_MQPRIO_SHAPER_MAX - 1)
>
> +enum {
> + TC_FP_EXPRESS = 1,
> + TC_FP_PREEMPTIBLE = 2,
> +};
Suggestion: Add a MAX to this enum (as is traditionally done)..
> +
> struct tc_mqprio_qopt {
> __u8 num_tc;
> __u8 prio_tc_map[TC_QOPT_BITMASK + 1];
> @@ -732,12 +737,23 @@ struct tc_mqprio_qopt {
> #define TC_MQPRIO_F_MIN_RATE 0x4
> #define TC_MQPRIO_F_MAX_RATE 0x8
>
> +enum {
> + TCA_MQPRIO_TC_ENTRY_UNSPEC,
> + TCA_MQPRIO_TC_ENTRY_INDEX, /* u32 */
> + TCA_MQPRIO_TC_ENTRY_FP, /* u32 */
> +
> + /* add new constants above here */
> + __TCA_MQPRIO_TC_ENTRY_CNT,
> + TCA_MQPRIO_TC_ENTRY_MAX = (__TCA_MQPRIO_TC_ENTRY_CNT - 1)
> +};
> +
> enum {
> TCA_MQPRIO_UNSPEC,
> TCA_MQPRIO_MODE,
> TCA_MQPRIO_SHAPER,
> TCA_MQPRIO_MIN_RATE64,
> TCA_MQPRIO_MAX_RATE64,
> + TCA_MQPRIO_TC_ENTRY,
> __TCA_MQPRIO_MAX,
> };
>
> diff --git a/net/sched/sch_mqprio.c b/net/sched/sch_mqprio.c
> index 5287ff60b3f9..bc158a7fd6ba 100644
> --- a/net/sched/sch_mqprio.c
> +++ b/net/sched/sch_mqprio.c
> @@ -5,6 +5,7 @@
> * Copyright (c) 2010 John Fastabend <[email protected]>
> */
>
> +#include <linux/ethtool_netlink.h>
> #include <linux/types.h>
> #include <linux/slab.h>
> #include <linux/kernel.h>
> @@ -27,6 +28,7 @@ struct mqprio_sched {
> u32 flags;
> u64 min_rate[TC_QOPT_MAX_QUEUE];
> u64 max_rate[TC_QOPT_MAX_QUEUE];
> + u32 fp[TC_QOPT_MAX_QUEUE];
> };
>
> static int mqprio_enable_offload(struct Qdisc *sch,
> @@ -63,6 +65,8 @@ static int mqprio_enable_offload(struct Qdisc *sch,
> return -EINVAL;
> }
>
> + mqprio_fp_to_offload(priv->fp, &mqprio);
> +
> err = dev->netdev_ops->ndo_setup_tc(dev, TC_SETUP_QDISC_MQPRIO,
> &mqprio);
> if (err)
> @@ -145,13 +149,94 @@ static int mqprio_parse_opt(struct net_device *dev, struct tc_mqprio_qopt *qopt,
> return 0;
> }
>
> +static const struct
> +nla_policy mqprio_tc_entry_policy[TCA_MQPRIO_TC_ENTRY_MAX + 1] = {
> + [TCA_MQPRIO_TC_ENTRY_INDEX] = NLA_POLICY_MAX(NLA_U32,
> + TC_QOPT_MAX_QUEUE),
And use it here...
Out of curiosity, could you have more that 16 queues in this spec? I
noticed 802.1p mentioned somewhere (typically 3 bits).
Lead up question: if the max is 16 then can preemptible_tcs for example be u32?
cheers,
jamal
> + [TCA_MQPRIO_TC_ENTRY_FP] = NLA_POLICY_RANGE(NLA_U32,
> + TC_FP_EXPRESS,
> + TC_FP_PREEMPTIBLE),
> +};
> +
> static const struct nla_policy mqprio_policy[TCA_MQPRIO_MAX + 1] = {
> [TCA_MQPRIO_MODE] = { .len = sizeof(u16) },
> [TCA_MQPRIO_SHAPER] = { .len = sizeof(u16) },
> [TCA_MQPRIO_MIN_RATE64] = { .type = NLA_NESTED },
> [TCA_MQPRIO_MAX_RATE64] = { .type = NLA_NESTED },
> + [TCA_MQPRIO_TC_ENTRY] = { .type = NLA_NESTED },
> };
>
> +static int mqprio_parse_tc_entry(u32 fp[TC_QOPT_MAX_QUEUE],
> + struct nlattr *opt,
> + unsigned long *seen_tcs,
> + struct netlink_ext_ack *extack)
> +{
> + struct nlattr *tb[TCA_MQPRIO_TC_ENTRY_MAX + 1] = { };
> + int err, tc;
> +
> + err = nla_parse_nested(tb, TCA_MQPRIO_TC_ENTRY_MAX, opt,
> + mqprio_tc_entry_policy, extack);
> + if (err < 0)
> + return err;
> +
> + if (!tb[TCA_MQPRIO_TC_ENTRY_INDEX]) {
> + NL_SET_ERR_MSG(extack, "TC entry index missing");
> + return -EINVAL;
> + }
> +
> + tc = nla_get_u32(tb[TCA_MQPRIO_TC_ENTRY_INDEX]);
> + if (*seen_tcs & BIT(tc)) {
> + NL_SET_ERR_MSG(extack, "Duplicate tc entry");
> + return -EINVAL;
> + }
> +
> + *seen_tcs |= BIT(tc);
> +
> + if (tb[TCA_MQPRIO_TC_ENTRY_FP])
> + fp[tc] = nla_get_u32(tb[TCA_MQPRIO_TC_ENTRY_FP]);
> +
> + return 0;
> +}
> +
> +static int mqprio_parse_tc_entries(struct Qdisc *sch, struct nlattr *nlattr_opt,
> + int nlattr_opt_len,
> + struct netlink_ext_ack *extack)
> +{
> + struct mqprio_sched *priv = qdisc_priv(sch);
> + struct net_device *dev = qdisc_dev(sch);
> + bool have_preemption = false;
> + unsigned long seen_tcs = 0;
> + u32 fp[TC_QOPT_MAX_QUEUE];
> + struct nlattr *n;
> + int tc, rem;
> + int err = 0;
> +
> + for (tc = 0; tc < TC_QOPT_MAX_QUEUE; tc++)
> + fp[tc] = priv->fp[tc];
> +
> + nla_for_each_attr(n, nlattr_opt, nlattr_opt_len, rem) {
> + if (nla_type(n) != TCA_MQPRIO_TC_ENTRY)
> + continue;
> +
> + err = mqprio_parse_tc_entry(fp, n, &seen_tcs, extack);
> + if (err)
> + goto out;
> + }
> +
> + for (tc = 0; tc < TC_QOPT_MAX_QUEUE; tc++) {
> + priv->fp[tc] = fp[tc];
> + if (fp[tc] == TC_FP_PREEMPTIBLE)
> + have_preemption = true;
> + }
> +
> + if (have_preemption && !ethtool_dev_mm_supported(dev)) {
> + NL_SET_ERR_MSG(extack, "Device does not support preemption");
> + return -EOPNOTSUPP;
> + }
> +out:
> + return err;
> +}
> +
> /* Parse the other netlink attributes that represent the payload of
> * TCA_OPTIONS, which are appended right after struct tc_mqprio_qopt.
> */
> @@ -234,6 +319,13 @@ static int mqprio_parse_nlattr(struct Qdisc *sch, struct tc_mqprio_qopt *qopt,
> priv->flags |= TC_MQPRIO_F_MAX_RATE;
> }
>
> + if (tb[TCA_MQPRIO_TC_ENTRY]) {
> + err = mqprio_parse_tc_entries(sch, nlattr_opt, nlattr_opt_len,
> + extack);
> + if (err)
> + return err;
> + }
> +
> return 0;
> }
>
> @@ -247,7 +339,7 @@ static int mqprio_init(struct Qdisc *sch, struct nlattr *opt,
> int i, err = -EOPNOTSUPP;
> struct tc_mqprio_qopt *qopt = NULL;
> struct tc_mqprio_caps caps;
> - int len;
> + int len, tc;
>
> BUILD_BUG_ON(TC_MAX_QUEUE != TC_QOPT_MAX_QUEUE);
> BUILD_BUG_ON(TC_BITMASK != TC_QOPT_BITMASK);
> @@ -265,6 +357,9 @@ static int mqprio_init(struct Qdisc *sch, struct nlattr *opt,
> if (!opt || nla_len(opt) < sizeof(*qopt))
> return -EINVAL;
>
> + for (tc = 0; tc < TC_QOPT_MAX_QUEUE; tc++)
> + priv->fp[tc] = TC_FP_EXPRESS;
> +
> qdisc_offload_query_caps(dev, TC_SETUP_QDISC_MQPRIO,
> &caps, sizeof(caps));
>
> @@ -415,6 +510,33 @@ static int dump_rates(struct mqprio_sched *priv,
> return -1;
> }
>
> +static int mqprio_dump_tc_entries(struct mqprio_sched *priv,
> + struct sk_buff *skb)
> +{
> + struct nlattr *n;
> + int tc;
> +
> + for (tc = 0; tc < TC_QOPT_MAX_QUEUE; tc++) {
> + n = nla_nest_start(skb, TCA_MQPRIO_TC_ENTRY);
> + if (!n)
> + return -EMSGSIZE;
> +
> + if (nla_put_u32(skb, TCA_MQPRIO_TC_ENTRY_INDEX, tc))
> + goto nla_put_failure;
> +
> + if (nla_put_u32(skb, TCA_MQPRIO_TC_ENTRY_FP, priv->fp[tc]))
> + goto nla_put_failure;
> +
> + nla_nest_end(skb, n);
> + }
> +
> + return 0;
> +
> +nla_put_failure:
> + nla_nest_cancel(skb, n);
> + return -EMSGSIZE;
> +}
> +
> static int mqprio_dump(struct Qdisc *sch, struct sk_buff *skb)
> {
> struct net_device *dev = qdisc_dev(sch);
> @@ -465,6 +587,9 @@ static int mqprio_dump(struct Qdisc *sch, struct sk_buff *skb)
> (dump_rates(priv, &opt, skb) != 0))
> goto nla_put_failure;
>
> + if (mqprio_dump_tc_entries(priv, skb))
> + goto nla_put_failure;
> +
> return nla_nest_end(skb, nla);
> nla_put_failure:
> nlmsg_trim(skb, nla);
> diff --git a/net/sched/sch_mqprio_lib.c b/net/sched/sch_mqprio_lib.c
> index c58a533b8ec5..83b3793c4012 100644
> --- a/net/sched/sch_mqprio_lib.c
> +++ b/net/sched/sch_mqprio_lib.c
> @@ -114,4 +114,18 @@ void mqprio_qopt_reconstruct(struct net_device *dev, struct tc_mqprio_qopt *qopt
> }
> EXPORT_SYMBOL_GPL(mqprio_qopt_reconstruct);
>
> +void mqprio_fp_to_offload(u32 fp[TC_QOPT_MAX_QUEUE],
> + struct tc_mqprio_qopt_offload *mqprio)
> +{
> + unsigned long preemptible_tcs = 0;
> + int tc;
> +
> + for (tc = 0; tc < TC_QOPT_MAX_QUEUE; tc++)
> + if (fp[tc] == TC_FP_PREEMPTIBLE)
> + preemptible_tcs |= BIT(tc);
> +
> + mqprio->preemptible_tcs = preemptible_tcs;
> +}
> +EXPORT_SYMBOL_GPL(mqprio_fp_to_offload);
> +
> MODULE_LICENSE("GPL");
> diff --git a/net/sched/sch_mqprio_lib.h b/net/sched/sch_mqprio_lib.h
> index 63f725ab8761..079f597072e3 100644
> --- a/net/sched/sch_mqprio_lib.h
> +++ b/net/sched/sch_mqprio_lib.h
> @@ -14,5 +14,7 @@ int mqprio_validate_qopt(struct net_device *dev, struct tc_mqprio_qopt *qopt,
> struct netlink_ext_ack *extack);
> void mqprio_qopt_reconstruct(struct net_device *dev,
> struct tc_mqprio_qopt *qopt);
> +void mqprio_fp_to_offload(u32 fp[TC_QOPT_MAX_QUEUE],
> + struct tc_mqprio_qopt_offload *mqprio);
>
> #endif
> --
> 2.34.1
>
On Mon, Apr 3, 2023 at 6:35 AM Vladimir Oltean <[email protected]> wrote:
>
> This is a duplication of the FP adminStatus logic introduced for
> tc-mqprio. Offloading is done through the tc_mqprio_qopt_offload
> structure embedded within tc_taprio_qopt_offload. So practically, if a
> device driver is written to treat the mqprio portion of taprio just like
> standalone mqprio, it gets unified handling of frame preemption.
>
> I would have reused more code with taprio, but this is mostly netlink
> attribute parsing, which is hard to transform into generic code without
> having something that stinks as a result. We have the same variables
> with the same semantics, just different nlattr type values
> (TCA_MQPRIO_TC_ENTRY=5 vs TCA_TAPRIO_ATTR_TC_ENTRY=12;
> TCA_MQPRIO_TC_ENTRY_FP=2 vs TCA_TAPRIO_TC_ENTRY_FP=3, etc) and
> consequently, different policies for the nest.
>
> Every time nla_parse_nested() is called, an on-stack table "tb" of
> nlattr pointers is allocated statically, up to the maximum understood
> nlattr type. That array size is hardcoded as a constant, but when
> transforming this into a common parsing function, it would become either
> a VLA (which the Linux kernel rightfully doesn't like) or a call to the
> allocator.
>
> Having FP adminStatus in tc-taprio can be seen as addressing the 802.1Q
> Annex S.3 "Scheduling and preemption used in combination, no HOLD/RELEASE"
> and S.4 "Scheduling and preemption used in combination with HOLD/RELEASE"
> use cases. HOLD and RELEASE events are emitted towards the underlying
> MAC Merge layer when the schedule hits a Set-And-Hold-MAC or a
> Set-And-Release-MAC gate operation. So within the tc-taprio UAPI space,
> one can distinguish between the 2 use cases by choosing whether to use
> the TC_TAPRIO_CMD_SET_AND_HOLD and TC_TAPRIO_CMD_SET_AND_RELEASE gate
> operations within the schedule, or just TC_TAPRIO_CMD_SET_GATES.
>
> A small part of the change is dedicated to refactoring the max_sdu
> nlattr parsing to put all logic under the "if" that tests for presence
> of that nlattr.
>
> Signed-off-by: Vladimir Oltean <[email protected]>
> Reviewed-by: Ferenc Fejes <[email protected]>
> Reviewed-by: Simon Horman <[email protected]>
> ---
> v3->v4: none
> v2->v3: none
> v1->v2: slightly reword commit message
>
> include/uapi/linux/pkt_sched.h | 1 +
> net/sched/sch_taprio.c | 65 +++++++++++++++++++++++++++-------
> 2 files changed, 53 insertions(+), 13 deletions(-)
>
> diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
> index b8d29be91b62..51a7addc56c6 100644
> --- a/include/uapi/linux/pkt_sched.h
> +++ b/include/uapi/linux/pkt_sched.h
> @@ -1252,6 +1252,7 @@ enum {
> TCA_TAPRIO_TC_ENTRY_UNSPEC,
> TCA_TAPRIO_TC_ENTRY_INDEX, /* u32 */
> TCA_TAPRIO_TC_ENTRY_MAX_SDU, /* u32 */
> + TCA_TAPRIO_TC_ENTRY_FP, /* u32 */
>
> /* add new constants above here */
> __TCA_TAPRIO_TC_ENTRY_CNT,
> diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
> index cbad43019172..76db9a10ef50 100644
> --- a/net/sched/sch_taprio.c
> +++ b/net/sched/sch_taprio.c
> @@ -7,6 +7,7 @@
> */
>
> #include <linux/ethtool.h>
> +#include <linux/ethtool_netlink.h>
> #include <linux/types.h>
> #include <linux/slab.h>
> #include <linux/kernel.h>
> @@ -96,6 +97,7 @@ struct taprio_sched {
> struct list_head taprio_list;
> int cur_txq[TC_MAX_QUEUE];
> u32 max_sdu[TC_MAX_QUEUE]; /* save info from the user */
> + u32 fp[TC_QOPT_MAX_QUEUE]; /* only for dump and offloading */
> u32 txtime_delay;
> };
>
> @@ -1002,6 +1004,9 @@ static const struct nla_policy entry_policy[TCA_TAPRIO_SCHED_ENTRY_MAX + 1] = {
> static const struct nla_policy taprio_tc_policy[TCA_TAPRIO_TC_ENTRY_MAX + 1] = {
> [TCA_TAPRIO_TC_ENTRY_INDEX] = { .type = NLA_U32 },
> [TCA_TAPRIO_TC_ENTRY_MAX_SDU] = { .type = NLA_U32 },
> + [TCA_TAPRIO_TC_ENTRY_FP] = NLA_POLICY_RANGE(NLA_U32,
> + TC_FP_EXPRESS,
> + TC_FP_PREEMPTIBLE),
Same comment applies as in patch 6 here..
cheers,
jamal
>
> static const struct nla_policy taprio_policy[TCA_TAPRIO_ATTR_MAX + 1] = {
> @@ -1524,6 +1529,7 @@ static int taprio_enable_offload(struct net_device *dev,
> mqprio_qopt_reconstruct(dev, &offload->mqprio.qopt);
> offload->mqprio.extack = extack;
> taprio_sched_to_offload(dev, sched, offload, &caps);
> + mqprio_fp_to_offload(q->fp, &offload->mqprio);
>
> for (tc = 0; tc < TC_MAX_QUEUE; tc++)
> offload->max_sdu[tc] = q->max_sdu[tc];
> @@ -1671,13 +1677,14 @@ static int taprio_parse_clockid(struct Qdisc *sch, struct nlattr **tb,
> static int taprio_parse_tc_entry(struct Qdisc *sch,
> struct nlattr *opt,
> u32 max_sdu[TC_QOPT_MAX_QUEUE],
> + u32 fp[TC_QOPT_MAX_QUEUE],
> unsigned long *seen_tcs,
> struct netlink_ext_ack *extack)
> {
> struct nlattr *tb[TCA_TAPRIO_TC_ENTRY_MAX + 1] = { };
> struct net_device *dev = qdisc_dev(sch);
> - u32 val = 0;
> int err, tc;
> + u32 val;
>
> err = nla_parse_nested(tb, TCA_TAPRIO_TC_ENTRY_MAX, opt,
> taprio_tc_policy, extack);
> @@ -1702,15 +1709,18 @@ static int taprio_parse_tc_entry(struct Qdisc *sch,
>
> *seen_tcs |= BIT(tc);
>
> - if (tb[TCA_TAPRIO_TC_ENTRY_MAX_SDU])
> + if (tb[TCA_TAPRIO_TC_ENTRY_MAX_SDU]) {
> val = nla_get_u32(tb[TCA_TAPRIO_TC_ENTRY_MAX_SDU]);
> + if (val > dev->max_mtu) {
> + NL_SET_ERR_MSG_MOD(extack, "TC max SDU exceeds device max MTU");
> + return -ERANGE;
> + }
>
> - if (val > dev->max_mtu) {
> - NL_SET_ERR_MSG_MOD(extack, "TC max SDU exceeds device max MTU");
> - return -ERANGE;
> + max_sdu[tc] = val;
> }
>
> - max_sdu[tc] = val;
> + if (tb[TCA_TAPRIO_TC_ENTRY_FP])
> + fp[tc] = nla_get_u32(tb[TCA_TAPRIO_TC_ENTRY_FP]);
>
> return 0;
> }
> @@ -1720,29 +1730,51 @@ static int taprio_parse_tc_entries(struct Qdisc *sch,
> struct netlink_ext_ack *extack)
> {
> struct taprio_sched *q = qdisc_priv(sch);
> + struct net_device *dev = qdisc_dev(sch);
> u32 max_sdu[TC_QOPT_MAX_QUEUE];
> + bool have_preemption = false;
> unsigned long seen_tcs = 0;
> + u32 fp[TC_QOPT_MAX_QUEUE];
> struct nlattr *n;
> int tc, rem;
> int err = 0;
>
> - for (tc = 0; tc < TC_QOPT_MAX_QUEUE; tc++)
> + for (tc = 0; tc < TC_QOPT_MAX_QUEUE; tc++) {
> max_sdu[tc] = q->max_sdu[tc];
> + fp[tc] = q->fp[tc];
> + }
>
> nla_for_each_nested(n, opt, rem) {
> if (nla_type(n) != TCA_TAPRIO_ATTR_TC_ENTRY)
> continue;
>
> - err = taprio_parse_tc_entry(sch, n, max_sdu, &seen_tcs,
> + err = taprio_parse_tc_entry(sch, n, max_sdu, fp, &seen_tcs,
> extack);
> if (err)
> - goto out;
> + return err;
> }
>
> - for (tc = 0; tc < TC_QOPT_MAX_QUEUE; tc++)
> + for (tc = 0; tc < TC_QOPT_MAX_QUEUE; tc++) {
> q->max_sdu[tc] = max_sdu[tc];
> + q->fp[tc] = fp[tc];
> + if (fp[tc] != TC_FP_EXPRESS)
> + have_preemption = true;
> + }
> +
> + if (have_preemption) {
> + if (!FULL_OFFLOAD_IS_ENABLED(q->flags)) {
> + NL_SET_ERR_MSG(extack,
> + "Preemption only supported with full offload");
> + return -EOPNOTSUPP;
> + }
> +
> + if (!ethtool_dev_mm_supported(dev)) {
> + NL_SET_ERR_MSG(extack,
> + "Device does not support preemption");
> + return -EOPNOTSUPP;
> + }
> + }
>
> -out:
> return err;
> }
>
> @@ -2023,7 +2055,7 @@ static int taprio_init(struct Qdisc *sch, struct nlattr *opt,
> {
> struct taprio_sched *q = qdisc_priv(sch);
> struct net_device *dev = qdisc_dev(sch);
> - int i;
> + int i, tc;
>
> spin_lock_init(&q->current_entry_lock);
>
> @@ -2080,6 +2112,9 @@ static int taprio_init(struct Qdisc *sch, struct nlattr *opt,
> q->qdiscs[i] = qdisc;
> }
>
> + for (tc = 0; tc < TC_QOPT_MAX_QUEUE; tc++)
> + q->fp[tc] = TC_FP_EXPRESS;
> +
> taprio_detect_broken_mqprio(q);
>
> return taprio_change(sch, opt, extack);
> @@ -2223,6 +2258,7 @@ static int dump_schedule(struct sk_buff *msg,
> }
>
> static int taprio_dump_tc_entries(struct sk_buff *skb,
> + struct taprio_sched *q,
> struct sched_gate_list *sched)
> {
> struct nlattr *n;
> @@ -2240,6 +2276,9 @@ static int taprio_dump_tc_entries(struct sk_buff *skb,
> sched->max_sdu[tc]))
> goto nla_put_failure;
>
> + if (nla_put_u32(skb, TCA_TAPRIO_TC_ENTRY_FP, q->fp[tc]))
> + goto nla_put_failure;
> +
> nla_nest_end(skb, n);
> }
>
> @@ -2281,7 +2320,7 @@ static int taprio_dump(struct Qdisc *sch, struct sk_buff *skb)
> nla_put_u32(skb, TCA_TAPRIO_ATTR_TXTIME_DELAY, q->txtime_delay))
> goto options_error;
>
> - if (oper && taprio_dump_tc_entries(skb, oper))
> + if (oper && taprio_dump_tc_entries(skb, q, oper))
> goto options_error;
>
> if (oper && dump_schedule(skb, oper))
> --
> 2.34.1
>
On Fri, Apr 07, 2023 at 12:22:26PM -0400, Jamal Hadi Salim wrote:
> > +enum {
> > + TC_FP_EXPRESS = 1,
> > + TC_FP_PREEMPTIBLE = 2,
> > +};
>
> Suggestion: Add a MAX to this enum (as is traditionally done)..
Max what? This doesn't count anything, it just expresses whether the
quality of one traffic class, from the Frame Preemption standard's
perspective, is express or preemptible...
> > @@ -145,13 +149,94 @@ static int mqprio_parse_opt(struct net_device *dev, struct tc_mqprio_qopt *qopt,
> > return 0;
> > }
> >
> > +static const struct
> > +nla_policy mqprio_tc_entry_policy[TCA_MQPRIO_TC_ENTRY_MAX + 1] = {
> > + [TCA_MQPRIO_TC_ENTRY_INDEX] = NLA_POLICY_MAX(NLA_U32,
> > + TC_QOPT_MAX_QUEUE),
>
> And use it here...
Where? Above or below the comment? I think you mean below (for the
policy of TCA_MQPRIO_TC_ENTRY_FP)?
> Out of curiosity, could you have more that 16 queues in this spec? I
> noticed 802.1p mentioned somewhere (typically 3 bits).
"This spec" is IEEE 802.1Q :) It doesn't say how many "queues" (struct
netdev_queue) there are, and this UAPI doesn't work with those, either.
The standard defines 8 priority values, groupable in (potentially fewer)
traffic classes. Linux liked to extend the number of traffic classes to
16 (and the skb->priority values are arbitrarily large IIUC) and this is
where that number 16 came from. The number of 16 traffic classes still
allows for more than 16 TXQs though.
> Lead up question: if the max is 16 then can preemptible_tcs for example be u32?
I don't understand this question, sorry. preemptible_tcs is declared as
"unsigned long", which IIUC is at least 32-bit.
>
> > + [TCA_MQPRIO_TC_ENTRY_FP] = NLA_POLICY_RANGE(NLA_U32,
> > + TC_FP_EXPRESS,
> > + TC_FP_PREEMPTIBLE),
On Fri, Apr 7, 2023 at 12:41 PM Vladimir Oltean <[email protected]> wrote:
>
> On Fri, Apr 07, 2023 at 12:22:26PM -0400, Jamal Hadi Salim wrote:
> > > +enum {
> > > + TC_FP_EXPRESS = 1,
> > > + TC_FP_PREEMPTIBLE = 2,
> > > +};
> >
> > Suggestion: Add a MAX to this enum (as is traditionally done)..
>
> Max what? This doesn't count anything, it just expresses whether the
> quality of one traffic class, from the Frame Preemption standard's
> perspective, is express or preemptible...
>
> > > @@ -145,13 +149,94 @@ static int mqprio_parse_opt(struct net_device *dev, struct tc_mqprio_qopt *qopt,
> > > return 0;
> > > }
> > >
> > > +static const struct
> > > +nla_policy mqprio_tc_entry_policy[TCA_MQPRIO_TC_ENTRY_MAX + 1] = {
> > > + [TCA_MQPRIO_TC_ENTRY_INDEX] = NLA_POLICY_MAX(NLA_U32,
> > > + TC_QOPT_MAX_QUEUE),
> >
> > And use it here...
>
> Where? Above or below the comment? I think you mean below (for the
> policy of TCA_MQPRIO_TC_ENTRY_FP)?
>
That was what I meant. I misread that code thinking it was a nested
TLV range check. If it is only going to be those two specific values,
I understand - but then wondering why you
need a u32; wouldnt a u8 be sufficient? The only reason you would need
a MAX is if it is possible that new values greater than
TC_FP_PREEMPTIBLE showing up in the future.
> > Out of curiosity, could you have more that 16 queues in this spec? I
> > noticed 802.1p mentioned somewhere (typically 3 bits).
>
> "This spec" is IEEE 802.1Q :) It doesn't say how many "queues" (struct
> netdev_queue) there are, and this UAPI doesn't work with those, either.
> The standard defines 8 priority values, groupable in (potentially fewer)
> traffic classes. Linux liked to extend the number of traffic classes to
> 16 (and the skb->priority values are arbitrarily large IIUC) and this is
> where that number 16 came from. The number of 16 traffic classes still
> allows for more than 16 TXQs though.
>
> > Lead up question: if the max is 16 then can preemptible_tcs for example be u32?
>
> I don't understand this question, sorry. preemptible_tcs is declared as
> "unsigned long", which IIUC is at least 32-bit.
I meant: if you only had 16 possible values, meaning 16 bits are
sufficient, (although i may be misunderstanding the goal of those
bits) why not be explicit and use the proper type/size?
cheers,
jamal
> >
> > > + [TCA_MQPRIO_TC_ENTRY_FP] = NLA_POLICY_RANGE(NLA_U32,
> > > + TC_FP_EXPRESS,
> > > + TC_FP_PREEMPTIBLE),
On Fri, Apr 07, 2023 at 02:49:01PM -0400, Jamal Hadi Salim wrote:
> On Fri, Apr 7, 2023 at 12:41 PM Vladimir Oltean <[email protected]> wrote:
> > On Fri, Apr 07, 2023 at 12:22:26PM -0400, Jamal Hadi Salim wrote:
> > > > +enum {
> > > > + TC_FP_EXPRESS = 1,
> > > > + TC_FP_PREEMPTIBLE = 2,
> > > > +};
> > >
> > > Suggestion: Add a MAX to this enum (as is traditionally done)..
> >
> > Max what? This doesn't count anything, it just expresses whether the
> > quality of one traffic class, from the Frame Preemption standard's
> > perspective, is express or preemptible...
> >
> > > > @@ -145,13 +149,94 @@ static int mqprio_parse_opt(struct net_device *dev, struct tc_mqprio_qopt *qopt,
> > > > return 0;
> > > > }
> > > >
> > > > +static const struct
> > > > +nla_policy mqprio_tc_entry_policy[TCA_MQPRIO_TC_ENTRY_MAX + 1] = {
> > > > + [TCA_MQPRIO_TC_ENTRY_INDEX] = NLA_POLICY_MAX(NLA_U32,
> > > > + TC_QOPT_MAX_QUEUE),
> > >
> > > And use it here...
> >
> > Where? Above or below the comment? I think you mean below (for the
> > policy of TCA_MQPRIO_TC_ENTRY_FP)?
>
> That was what I meant. I misread that code thinking it was a nested
> TLV range check. If it is only going to be those two specific values,
> I understand - but then wondering why you need a u32; wouldnt a u8 be
> sufficient?
I believe netlink isn't exactly optimized for passing small values; the
netlink attributes are going to be aligned to NLA_ALIGNTO (4) anyway,
so it's not like this is going to save space or something. Also, there's
a policy restricting the maximum, so arbitrarily large values cannot be
passed now, but could be passed later if needed. I did not see any good
enough reason to close that door.
> The only reason you would need a MAX is if it is possible that new
> values greater than TC_FP_PREEMPTIBLE showing up in the future.
Even if someone wants to add TC_FP_KINDA_PREEMPTIBLE = 3 and
TC_FP_PREEMPTIBLE_WITH_STRIPES = 4 in the future, I'm still not sure how
a MAX definition exported by the kernel is going to help them?
I mean, about the only thing that it would avoid is that I wouldn't be
changing the policy definition, but that's rather minor and doesn't
justify exporting something to UAPI? The changed MAX value is only a
property of the kernel headers that the application is compiled with -
it doesn't give the capability of the running kernel.
To see whether TC_FP_PREEMPTIBLE_WITH_STRIPES is supported, the
application would have to try it and see if it fails. Which is also the
case right now with TC_FP_PREEMPTIBLE.
> > > Lead up question: if the max is 16 then can preemptible_tcs for example be u32?
> >
> > I don't understand this question, sorry. preemptible_tcs is declared as
> > "unsigned long", which IIUC is at least 32-bit.
>
> I meant: if you only had 16 possible values, meaning 16 bits are
> sufficient, (although i may be misunderstanding the goal of those
> bits) why not be explicit and use the proper type/size?
If you think it's valuable to change the type of preemptible_tcs from
unsigned long to u16 and that's a more "proper" type, I can do so.
On Fri, Apr 7, 2023 at 3:31 PM Vladimir Oltean <[email protected]> wrote:
>
> On Fri, Apr 07, 2023 at 02:49:01PM -0400, Jamal Hadi Salim wrote:
> > On Fri, Apr 7, 2023 at 12:41 PM Vladimir Oltean <[email protected]> wrote:
> > > On Fri, Apr 07, 2023 at 12:22:26PM -0400, Jamal Hadi Salim wrote:
> > > > > +enum {
> > > > > + TC_FP_EXPRESS = 1,
> > > > > + TC_FP_PREEMPTIBLE = 2,
> > > > > +};
> > > >
> > > > Suggestion: Add a MAX to this enum (as is traditionally done)..
> > >
> > > Max what? This doesn't count anything, it just expresses whether the
> > > quality of one traffic class, from the Frame Preemption standard's
> > > perspective, is express or preemptible...
> > >
> > > > > @@ -145,13 +149,94 @@ static int mqprio_parse_opt(struct net_device *dev, struct tc_mqprio_qopt *qopt,
> > > > > return 0;
> > > > > }
> > > > >
> > > > > +static const struct
> > > > > +nla_policy mqprio_tc_entry_policy[TCA_MQPRIO_TC_ENTRY_MAX + 1] = {
> > > > > + [TCA_MQPRIO_TC_ENTRY_INDEX] = NLA_POLICY_MAX(NLA_U32,
> > > > > + TC_QOPT_MAX_QUEUE),
> > > >
> > > > And use it here...
> > >
> > > Where? Above or below the comment? I think you mean below (for the
> > > policy of TCA_MQPRIO_TC_ENTRY_FP)?
> >
> > That was what I meant. I misread that code thinking it was a nested
> > TLV range check. If it is only going to be those two specific values,
> > I understand - but then wondering why you need a u32; wouldnt a u8 be
> > sufficient?
>
> I believe netlink isn't exactly optimized for passing small values; the
> netlink attributes are going to be aligned to NLA_ALIGNTO (4) anyway,
> so it's not like this is going to save space or something. Also, there's
> a policy restricting the maximum, so arbitrarily large values cannot be
> passed now, but could be passed later if needed. I did not see any good
> enough reason to close that door.
>
> > The only reason you would need a MAX is if it is possible that new
> > values greater than TC_FP_PREEMPTIBLE showing up in the future.
>
> Even if someone wants to add TC_FP_KINDA_PREEMPTIBLE = 3 and
> TC_FP_PREEMPTIBLE_WITH_STRIPES = 4 in the future, I'm still not sure how
> a MAX definition exported by the kernel is going to help them?
>
> I mean, about the only thing that it would avoid is that I wouldn't be
> changing the policy definition, but that's rather minor and doesn't
> justify exporting something to UAPI?
Yes, it is minor (and usually minor things generate the most emails;->).
I may be misunderstanding what you mean by "doesnt justify exporting
something to UAPI" - those definitions are part of uapi and are
already
being exported.
> The changed MAX value is only a
> property of the kernel headers that the application is compiled with -
> it doesn't give the capability of the running kernel.
>
Maybe I am missing something or not communicating effectively. What i
am suggesting is something very trivial:
enum {
TC_FP_EXPRESS = 1,
TC_FP_PREEMPTIBLE = 2,
TC_FP_MAX
};
[TCA_MQPRIO_TC_ENTRY_FP] = NLA_POLICY_RANGE(NLA_U32,
TC_FP_EXPRESS,
TC_FP_MAX),
And in a newer revision with TC_FP_PREEMPTIBLE_WITH_STRIPES:
enum {
TC_FP_EXPRESS = 1,
TC_FP_PREEMPTIBLE = 2,
TC_FP_PREEMPTIBLE_WITH_STRIPES = 3,
TC_FP_MAX
};
etc.
Saves you one line in a patch for when TC_FP_PREEMPTIBLE_WITH_STRIPES shows up.
> To see whether TC_FP_PREEMPTIBLE_WITH_STRIPES is supported, the
> application would have to try it and see if it fails. Which is also the
> case right now with TC_FP_PREEMPTIBLE.
You may be referring to the combination of iproute2/kernel.
In all cases, NLA_POLICY_RANGE will take care of rejecting something
out of bound.
> > > > Lead up question: if the max is 16 then can preemptible_tcs for example be u32?
> > >
> > > I don't understand this question, sorry. preemptible_tcs is declared as
> > > "unsigned long", which IIUC is at least 32-bit.
> >
> > I meant: if you only had 16 possible values, meaning 16 bits are
> > sufficient, (although i may be misunderstanding the goal of those
> > bits) why not be explicit and use the proper type/size?
>
> If you think it's valuable to change the type of preemptible_tcs from
> unsigned long to u16 and that's a more "proper" type, I can do so.
No, no, it is a matter of taste and opinion. You may have noticed,
trivial stuff like this gets the most comments and reviews normally(we
just spent like 4-5 emails on this?). Poteto/potato: IOW, if i was to
do it i would have used a u16 or u32 because i feel it would be more
readable. I would have used NLA_U8 because i felt it is more fitting
and i would have used a max value because it would save me one line in
a patch in the future. I think weve spent enough electrons on this - I
defer to you.
cheers,
jamal
On Fri, Apr 07, 2023 at 05:40:20PM -0400, Jamal Hadi Salim wrote:
> Yes, it is minor (and usually minor things generate the most emails;->).
> I may be misunderstanding what you mean by "doesnt justify exporting
> something to UAPI" - those definitions are part of uapi and are
> already being exported.
In my proposed patch set there isn't any TC_FP_MAX. I'm saying it
doesn't help user space, and so, it just pollutes the name space of C
programs with no good reason.
> > The changed MAX value is only a
> > property of the kernel headers that the application is compiled with -
> > it doesn't give the capability of the running kernel.
>
> Maybe I am missing something or not communicating effectively. What i
> am suggesting is something very trivial:
>
> enum {
> TC_FP_EXPRESS = 1,
> TC_FP_PREEMPTIBLE = 2,
> TC_FP_MAX
> };
>
> [TCA_MQPRIO_TC_ENTRY_FP] = NLA_POLICY_RANGE(NLA_U32,
> TC_FP_EXPRESS,
> TC_FP_MAX),
>
> And in a newer revision with TC_FP_PREEMPTIBLE_WITH_STRIPES:
>
> enum {
> TC_FP_EXPRESS = 1,
> TC_FP_PREEMPTIBLE = 2,
> TC_FP_PREEMPTIBLE_WITH_STRIPES = 3,
> TC_FP_MAX
> };
> etc.
>
> Saves you one line in a patch for when TC_FP_PREEMPTIBLE_WITH_STRIPES shows up.
Right, and I don't think that saving me one line in a kernel patch is a
good enough reason to add TC_FP_MAX to include/uapi/, when I can't find
a reason why user space would be interested in TC_FP_MAX anyway.
> > If you think it's valuable to change the type of preemptible_tcs from
> > unsigned long to u16 and that's a more "proper" type, I can do so.
>
> No, no, it is a matter of taste and opinion. You may have noticed,
> trivial stuff like this gets the most comments and reviews normally(we
> just spent like 4-5 emails on this?). Poteto/potato: IOW, if i was to
> do it i would have used a u16 or u32 because i feel it would be more
> readable. I would have used NLA_U8 because i felt it is more fitting
> and i would have used a max value because it would save me one line in
> a patch in the future. I think weve spent enough electrons on this - I
> defer to you.
Ok, I won't change preemptible_tcs from unsigned long to u32.
Things like for_each_set_bit() take unsigned long, and so, I got used
to using that consistently for small bitfield types.
If there's a second opinion stating that I should prefer the smallest
netlink attribute type that fits the estimated data, then I'll transition
from NLA_U32 to NLA_U8. Otherwise, I won't :) since I would need to
change iproute2 too, and I'd have to re-test more thoroughly to make
sure I don't introduce stupid bugs.
On Sat, 8 Apr 2023 00:52:52 +0300 Vladimir Oltean wrote:
> On Fri, Apr 07, 2023 at 05:40:20PM -0400, Jamal Hadi Salim wrote:
> > Yes, it is minor (and usually minor things generate the most emails;->).
> > I may be misunderstanding what you mean by "doesnt justify exporting
> > something to UAPI" - those definitions are part of uapi and are
> > already being exported.
>
> In my proposed patch set there isn't any TC_FP_MAX. I'm saying it
> doesn't help user space, and so, it just pollutes the name space of C
> programs with no good reason.
+1 we tend to sprinkle MAX and UNSPEC into every enum
> > No, no, it is a matter of taste and opinion. You may have noticed,
> > trivial stuff like this gets the most comments and reviews normally(we
> > just spent like 4-5 emails on this?). Poteto/potato: IOW, if i was to
> > do it i would have used a u16 or u32 because i feel it would be more
> > readable. I would have used NLA_U8 because i felt it is more fitting
> > and i would have used a max value because it would save me one line in
> > a patch in the future. I think weve spent enough electrons on this - I
> > defer to you.
>
> Ok, I won't change preemptible_tcs from unsigned long to u32.
> Things like for_each_set_bit() take unsigned long, and so, I got used
> to using that consistently for small bitfield types.
>
> If there's a second opinion stating that I should prefer the smallest
> netlink attribute type that fits the estimated data, then I'll transition
> from NLA_U32 to NLA_U8. Otherwise, I won't :) since I would need to
> change iproute2 too, and I'd have to re-test more thoroughly to make
> sure I don't introduce stupid bugs.
And here also agreed. We should have a patchwork check for new uses of
NLA_*{8,16} if you ask me :S NLA_FLAG or NLA_U32, anything in between
needs a strong justification. Until Alex L posts the variable size
ints, then NLA_FLAG or NLA_UINT ;)
On Wed, Apr 05, 2023 at 06:12:34PM -0700, Jakub Kicinski wrote:
> On Mon, 3 Apr 2023 13:34:37 +0300 Vladimir Oltean wrote:
> > +static int mqprio_parse_tc_entry(u32 fp[TC_QOPT_MAX_QUEUE],
> > + struct nlattr *opt,
> > + unsigned long *seen_tcs,
> > + struct netlink_ext_ack *extack)
> > +{
> > + struct nlattr *tb[TCA_MQPRIO_TC_ENTRY_MAX + 1] = { };
>
> nit: no need to clear it nla_parse*() zeros the memory
ok.
> > + int err, tc;
> > +
> > + err = nla_parse_nested(tb, TCA_MQPRIO_TC_ENTRY_MAX, opt,
> > + mqprio_tc_entry_policy, extack);
> > + if (err < 0)
> > + return err;
> > +
> > + if (!tb[TCA_MQPRIO_TC_ENTRY_INDEX]) {
> > + NL_SET_ERR_MSG(extack, "TC entry index missing");
>
> Are you not using NL_REQ_ATTR_CHECK() because iproute can't actually
> parse the result? :(
I could use it though.. let's assume that iproute2 is "reference code"
and gets the nlattr structure right. Thus, the NLMSGERR_ATTR_MISS_NEST
would be of more interest for custom user programs.
Speaking of which, is there any reference example of how to use
NLMSGERR_ATTR_MISS_NEST? My search came up empty handed:
https://github.com/search?p=1&q=NLMSGERR_ATTR_MISS_NEST&type=Code
I usually steal from hostap's error_handler(), but it looks like it
hasn't gotten that advanced yet as to re-parse the netlink message to
understand the reason why it got rejected.
>
> > + return -EINVAL;
> > + }
> > +
> > + tc = nla_get_u32(tb[TCA_MQPRIO_TC_ENTRY_INDEX]);
> > + if (*seen_tcs & BIT(tc)) {
> > + NL_SET_ERR_MSG(extack, "Duplicate tc entry");
>
> set attr in extack?
ok
> minor heads up - I'll take the trivial cleanup patch from Pedro
> so make sure you rebase:
> https://lore.kernel.org/all/[email protected]/
ok
On Tue, 11 Apr 2023 20:01:51 +0300 Vladimir Oltean wrote:
> > > + int err, tc;
> > > +
> > > + err = nla_parse_nested(tb, TCA_MQPRIO_TC_ENTRY_MAX, opt,
> > > + mqprio_tc_entry_policy, extack);
> > > + if (err < 0)
> > > + return err;
> > > +
> > > + if (!tb[TCA_MQPRIO_TC_ENTRY_INDEX]) {
> > > + NL_SET_ERR_MSG(extack, "TC entry index missing");
> >
> > Are you not using NL_REQ_ATTR_CHECK() because iproute can't actually
> > parse the result? :(
>
> I could use it though.. let's assume that iproute2 is "reference code"
> and gets the nlattr structure right. Thus, the NLMSGERR_ATTR_MISS_NEST
> would be of more interest for custom user programs.
>
> Speaking of which, is there any reference example of how to use
> NLMSGERR_ATTR_MISS_NEST? My search came up empty handed:
> https://github.com/search?p=1&q=NLMSGERR_ATTR_MISS_NEST&type=Code
Only YNL to my knowledge, basic support in the in-tree Python:
https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/tree/tools/net/ynl/lib/ynl.py#n198
And fuller support in the user space hacks which never made it out
of my GitHub:
https://github.com/kuba-moo/linux/blob/ynl-user-c-wip/tools/net/ynl/lib/ynl.c#L163
> I usually steal from hostap's error_handler(), but it looks like it
> hasn't gotten that advanced yet as to re-parse the netlink message to
> understand the reason why it got rejected.