LinuxLists.cc - [PATCH net-next v5] sched: Add dualpi2 qdisc

2019-08-22 10:39:36

by Tilmans, Olivier (Nokia - BE/Antwerp)

Subject: [PATCH net-next v5] sched: Add dualpi2 qdisc

From: Olga Albisser <[email protected]>

DualPI2 provides L4S-type low latency & loss to traffic that uses a
scalable congestion controller (e.g. TCP-Prague, DCTCP) without
degrading the performance of 'classic' traffic (e.g. Reno,
Cubic etc.). It is intended to be the reference implementation of the
IETF's DualQ Coupled AQM.

The qdisc provides two queues called low latency and classic. It
classifies packets based on the ECN field in the IP headers. By
default it directs non-ECN and ECT(0) into the classic queue and
ECT(1) and CE into the low latency queue, as per the IETF spec.

Each queue runs its own AQM:
* The classic AQM is called PI2, which is similar to the PIE AQM but
more responsive and simpler. Classic traffic requires a decent
target queue (default 15ms for Internet deployment) to fully
utilize the link and to avoid high drop rates.
* The low latency AQM is, by default, a very shallow ECN marking
threshold (1ms) similar to that used for DCTCP.

The DualQ isolates the low queuing delay of the Low Latency queue
from the larger delay of the 'Classic' queue. However, from a
bandwidth perspective, flows in either queue will share out the link
capacity as if there was just a single queue. This bandwidth pooling
effect is achieved by coupling together the drop and ECN-marking
probabilities of the two AQMs.

The PI2 AQM has two main parameters in addition to its target delay.
All the defaults are suitable for any Internet setting, but it can
be reconfigured for a Data Centre setting. The integral gain factor
alpha is used to slowly correct any persistent standing queue error
from the target delay, while the proportional gain factor beta is
used to quickly compensate for queue changes (growth or shrinkage).
Either alpha and beta are given as a parameter, or they can be
calculated by tc from alternative typical and maximum RTT parameters.

Internally, the output of a linear Proportional Integral (PI)
controller is used for both queues. This output is squared to
calculate the drop or ECN-marking probability of the classic queue.
This counterbalances the square-root rate equation of Reno/Cubic,
which is the trick that balances flow rates across the queues. For
the ECN-marking probability of the low latency queue, the output of
the base AQM is multiplied by a coupling factor. This determines the
balance between the flow rates in each queue. The default setting
makes the flow rates roughly equal, which should be generally
applicable.

If DUALPI2 AQM has detected overload (due to excessive non-responsive
traffic in either queue), it will switch to signaling congestion
solely using drop, irrespective of the ECN field. Alternatively, it
can be configured to limit the drop probability and let the queue
grow and eventually overflow (like tail-drop).

Additional details can be found in the draft:
https://www.ietf.org/id/draft-ietf-tsvwg-aqm-dualq-coupled

Signed-off-by: Olga Albisser <[email protected]>
Signed-off-by: Koen De Schepper <[email protected]>
Signed-off-by: Olivier Tilmans <[email protected]>
Signed-off-by: Bob Briscoe <[email protected]>
Signed-off-by: Henrik Steen <[email protected]>
---

Notes:
Changelog:
* v4 -> v5
- Fix do_div() usage in calculate_probability() to preserve sign
* v3 -> v4
- Replaced license boiletplate with SPDX identifier
- Fix missing pskb_may_pull() calls when accessing ECN bits
- Move timestamp computation at enqueue to happen after drop check
- Use NMI-safe time keeping function, i.e., ktime_get_ns()
- Switched from deprecated PSCHED_NS2TICKS/... to raw nanoseconds clocks
- Validate netlink parameters properly (ranges, error reporting)
- Expanded the statistics tracked/reported to better reflect the behavior of
both queues
- Simplified the qdisc structure:
o Reworked classification logic to only depend on an ECN mask
o Renamed most parameters to better reflect their usage
o Removed unused/experimental features (e.g., TS-FIFO)
o Restructured the skb->cb
o Extracted helper functions
- Fix compilation issues for ARM
- Updated defaults parameter values to latest IETF ID
- Fix the step AQM being applied on empty queues, causing excess marking on
slower links
* v2 -> v3
- Fix compilation issues
- Replaced the classic queue starvation protection from time-shifted FIFO
to WRR, as it gives better results (e.g., prevents leaking burst in the C
queue to the L queue)
* v1 -> v2
- Store enqueue timestamp in skb->cb to avoid conflict with EDT

include/uapi/linux/pkt_sched.h | 33 ++
net/sched/Kconfig | 22 +-
net/sched/Makefile | 1 +
net/sched/sch_dualpi2.c | 746 +++++++++++++++++++++++++++++++++
4 files changed, 801 insertions(+), 1 deletion(-)
create mode 100644 net/sched/sch_dualpi2.c

diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 18f185299f47..e2ad4a8d2059 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -1180,4 +1180,37 @@ enum {

#define TCA_TAPRIO_ATTR_MAX (__TCA_TAPRIO_ATTR_MAX - 1)

+/* DUALPI2 */
+enum {
+ TCA_DUALPI2_UNSPEC,
+ TCA_DUALPI2_LIMIT, /* Packets */
+ TCA_DUALPI2_TARGET, /* us */
+ TCA_DUALPI2_TUPDATE, /* us */
+ TCA_DUALPI2_ALPHA, /* Hz scaled up by 256 */
+ TCA_DUALPI2_BETA, /* HZ scaled up by 256 */
+ TCA_DUALPI2_STEP_THRESH, /* Packets or us */
+ TCA_DUALPI2_STEP_PACKETS, /* Whether STEP_THRESH is in packets */
+ TCA_DUALPI2_COUPLING, /* Coupling factor between queues */
+ TCA_DUALPI2_DROP_OVERLOAD, /* Whether to drop on overload */
+ TCA_DUALPI2_DROP_EARLY, /* Whether to drop on enqueue */
+ TCA_DUALPI2_C_PROTECTION, /* Percentage */
+ TCA_DUALPI2_ECN_MASK, /* L4S queue classification mask */
+ TCA_DUALPI2_PAD,
+ __TCA_DUALPI2_MAX
+};
+
+#define TCA_DUALPI2_MAX (__TCA_DUALPI2_MAX - 1)
+
+struct tc_dualpi2_xstats {
+ __u32 prob; /* current probability */
+ __u32 delay_c; /* current delay in C queue */
+ __u32 delay_l; /* current delay in L queue */
+ __s32 credit; /* current c_protection credit */
+ __u32 packets_in_c; /* number of packets enqueued in C queue */
+ __u32 packets_in_l; /* number of packets enqueued in L queue */
+ __u32 maxq; /* maximum queue size */
+ __u32 ecn_mark; /* packets marked with ecn*/
+ __u32 step_marks; /* ECN marks due to the step AQM */
+};
+
#endif
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index afd2ba157a13..f9340c18c3a2 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -409,6 +409,26 @@ config NET_SCH_PLUG
To compile this code as a module, choose M here: the
module will be called sch_plug.

+config NET_SCH_DUALPI2
+ tristate "Dual Queue Proportional Integral Controller Improved with a Square (DUALPI2) scheduler"
+ help
+ Say Y here if you want to use the DualPI2 AQM.
+ This is a combination of the DUALQ Coupled-AQM with a PI2 base-AQM.
+ The PI2 AQM is in turn both an extension and a simplification of the
+ PIE AQM. PI2 makes quite some PIE heuristics unnecessary, while being
+ able to control scalable congestion controls like DCTCP and
+ TCP-Prague. With PI2, both Reno/Cubic can be used in parallel with
+ DCTCP, maintaining window fairness. DUALQ provides latency separation
+ between low latency DCTCP flows and Reno/Cubic flows that need a
+ bigger queue.
+ For more information, please see
+ https://www.ietf.org/id/draft-ietf-tsvwg-aqm-dualq-coupled
+
+ To compile this code as a module, choose M here: the module
+ will be called sch_dualpi2.
+
+ If unsure, say N.
+
menuconfig NET_SCH_DEFAULT
bool "Allow override default queue discipline"
---help---
@@ -418,7 +438,7 @@ menuconfig NET_SCH_DEFAULT
of pfifo_fast will be used. Many distributions already set
the default value via /proc/sys/net/core/default_qdisc.

- If unsure, say N.
+

if NET_SCH_DEFAULT

diff --git a/net/sched/Makefile b/net/sched/Makefile
index 415d1e1f237e..8e3bd4459eb4 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -61,6 +61,7 @@ obj-$(CONFIG_NET_SCH_PIE) += sch_pie.o
obj-$(CONFIG_NET_SCH_CBS) += sch_cbs.o
obj-$(CONFIG_NET_SCH_ETF) += sch_etf.o
obj-$(CONFIG_NET_SCH_TAPRIO) += sch_taprio.o
+obj-$(CONFIG_NET_SCH_DUALPI2) += sch_dualpi2.o

obj-$(CONFIG_NET_CLS_U32) += cls_u32.o
obj-$(CONFIG_NET_CLS_ROUTE4) += cls_route.o
diff --git a/net/sched/sch_dualpi2.c b/net/sched/sch_dualpi2.c
new file mode 100644
index 000000000000..c6c851499d35
--- /dev/null
+++ b/net/sched/sch_dualpi2.c
@@ -0,0 +1,746 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2019 Nokia.
+ *
+ * Author: Koen De Schepper <[email protected]>
+ * Author: Olga Albisser <[email protected]>
+ * Author: Henrik Steen <[email protected]>
+ * Author: Olivier Tilmans <[email protected]>
+ *
+ * DualPI Improved with a Square (dualpi2):
+ * Supports scalable congestion controls (e.g., DCTCP)
+ * Supports coupled dual-queue with PI2
+ * Supports L4S ECN identifier
+ *
+ * References:
+ * draft-ietf-tsvwg-aqm-dualq-coupled:
+ * http://tools.ietf.org/html/draft-ietf-tsvwg-aqm-dualq-coupled-08
+ * De Schepper, Koen, et al. "PI 2: A linearized AQM for both classic and
+ * scalable TCP." in proc. ACM CoNEXT'16, 2016.
+ */
+
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/skbuff.h>
+#include <net/pkt_sched.h>
+#include <net/inet_ecn.h>
+#include <linux/string.h>
+
+/* 32b enable to support flows with windows up to ~8.6 * 1e9 packets
+ * i.e., twice the maximal snd_cwnd.
+ * MAX_PROB must be consistent with the RNG in dualpi2_roll().
+ */
+#define MAX_PROB ((u32)(~((u32)0)))
+/* alpha/beta values exchanged over netlink are in units of 256ns */
+#define ALPHA_BETA_SHIFT 8
+/* Scaled values of alpha/beta must fit in 32b to avoid overflow in later
+ * computations. Consequently (see and dualpi2_scale_alpha_beta()), their
+ * netlink-provided values can use at most 31b, i.e. be at most most (2^23)-1
+ * (~4MHz) as those are given in 1/256th. This enable to tune alpha/beta to
+ * control flows whose maximal RTTs can be in usec up to few secs.
+ */
+#define ALPHA_BETA_MAX ((2 << 31) - 1)
+/* Internal alpha/beta are in units of 64ns.
+ * This enables to use all alpha/beta values in the allowed range without loss
+ * of precision due to rounding when scaling them internally, e.g.,
+ * scale_alpha_beta(1) will not round down to 0.
+ */
+#define ALPHA_BETA_GRANULARITY 6
+#define ALPHA_BETA_SCALING (ALPHA_BETA_SHIFT - ALPHA_BETA_GRANULARITY)
+/* We express the weights (wc, wl) in %, i.e., wc + wl = 100 */
+#define MAX_WC 100
+
+struct dualpi2_sched_data {
+ struct Qdisc *l_queue; /* The L4S LL queue */
+ struct Qdisc *sch; /* The classic queue (owner of this struct) */
+
+ struct { /* PI2 parameters */
+ u64 target; /* Target delay in nanoseconds */
+ u32 tupdate;/* timer frequency (in jiffies) */
+ u32 prob; /* Base PI2 probability */
+ u32 alpha; /* Gain factor for the integral rate response */
+ u32 beta; /* Gain factor for the proportional response */
+ struct timer_list timer; /* prob update timer */
+ } pi2;
+
+ struct { /* Step AQM (L4S queue only) parameters */
+ u32 thresh; /* Step threshold */
+ bool in_packets;/* Whether the step is in packets or time */
+ } step;
+
+ struct { /* Classic queue starvation protection */
+ s32 credit; /* Credit (sign indicates which queue) */
+ s32 init; /* Reset value of the credit */
+ u8 wc; /* C queue weight (between 0 and MAX_WC) */
+ u8 wl; /* L queue weight (MAX_WC - wc) */
+ } c_protection;
+
+ /* General dualQ parameters */
+ u8 coupling_factor;/* Coupling factor (k) between both queues */
+ u8 ecn_mask; /* Mask to match L4S packets */
+ bool drop_early; /* Drop at enqueue instead of dequeue if true */
+ bool drop_overload; /* Drop (1) on overload, or overflow (0) */
+
+ /* Statistics */
+ u64 qdelay_c; /* Classic Q delay */
+ u64 qdelay_l; /* L4S Q delay */
+ u32 packets_in_c; /* Number of packets enqueued in C queue */
+ u32 packets_in_l; /* Number of packets enqueued in L queue */
+ u32 maxq; /* maximum queue size */
+ u32 ecn_mark; /* packets marked with ECN */
+ u32 step_marks; /* ECN marks due to the step AQM */
+
+ struct { /* Deferred drop statistics */
+ u32 cnt; /* Packets dropped */
+ u32 len; /* Bytes dropped */
+ } deferred_drops;
+};
+
+struct dualpi2_skb_cb {
+ u64 ts; /* Timestamp at enqueue */
+ u8 apply_step:1,/* Can we apply the step threshold */
+ l4s:1, /* Packet has been classified as L4S */
+ ect:2; /* Packet ECT codepoint */
+};
+
+static inline struct dualpi2_skb_cb *dualpi2_skb_cb(struct sk_buff *skb)
+{
+ qdisc_cb_private_validate(skb, sizeof(struct dualpi2_skb_cb));
+ return (struct dualpi2_skb_cb *)qdisc_skb_cb(skb)->data;
+}
+
+static inline u64 skb_sojourn_time(struct sk_buff *skb, u64 reference)
+{
+ return reference - dualpi2_skb_cb(skb)->ts;
+}
+
+static inline u64 qdelay_in_ns(struct Qdisc *q, u64 now)
+{
+ struct sk_buff *skb = qdisc_peek_head(q);
+
+ return skb ? skb_sojourn_time(skb, now) : 0;
+}
+
+static inline u32 dualpi2_scale_alpha_beta(u32 param)
+{
+ u64 tmp = ((u64)param * MAX_PROB >> ALPHA_BETA_SCALING);
+ do_div(tmp, NSEC_PER_SEC);
+ return tmp;
+}
+
+static inline u32 dualpi2_unscale_alpha_beta(u32 param)
+{
+ u64 tmp = ((u64)param * NSEC_PER_SEC << ALPHA_BETA_SCALING);
+ do_div(tmp, MAX_PROB);
+ return tmp;
+}
+
+static inline bool skb_is_l4s(struct sk_buff *skb)
+{
+ return dualpi2_skb_cb(skb)->l4s != 0;
+}
+
+static inline void dualpi2_mark(struct dualpi2_sched_data *q,
+ struct sk_buff *skb)
+{
+ if (INET_ECN_set_ce(skb))
+ q->ecn_mark++;
+}
+
+static inline void dualpi2_reset_c_protection(struct dualpi2_sched_data *q)
+{
+ q->c_protection.credit = q->c_protection.init;
+}
+
+static inline void dualpi2_calculate_c_protection(struct Qdisc *sch,
+ struct dualpi2_sched_data *q,
+ u32 wc)
+{
+ q->c_protection.wc = wc;
+ q->c_protection.wl = MAX_WC - wc;
+ /* Start with L queue if wl > wc */
+ q->c_protection.init = (s32)psched_mtu(qdisc_dev(sch)) *
+ ((int)q->c_protection.wc - (int)q->c_protection.wl);
+ dualpi2_reset_c_protection(q);
+}
+
+static inline bool dualpi2_roll(u32 prob)
+{
+ return prandom_u32() <= prob;
+}
+
+static inline bool dualpi2_squared_roll(struct dualpi2_sched_data *q)
+{
+ return dualpi2_roll(q->pi2.prob) && dualpi2_roll(q->pi2.prob);
+}
+
+static inline bool dualpi2_is_overloaded(u64 prob)
+{
+ return prob > MAX_PROB;
+}
+
+static bool must_drop(struct Qdisc *sch, struct dualpi2_sched_data *q,
+ struct sk_buff *skb)
+{
+ u64 local_l_prob;
+
+ /* Never drop if we have fewer than 2 mtu-sized packets;
+ * similar to min_th in RED.
+ */
+ if (sch->qstats.backlog < 2 * psched_mtu(qdisc_dev(sch)))
+ return false;
+
+ local_l_prob = (u64)q->pi2.prob * q->coupling_factor;
+
+ if (skb_is_l4s(skb)) {
+ if (dualpi2_is_overloaded(local_l_prob)) {
+ /* On overload, preserve delay by doing a classic drop
+ * in the L queue. Otherwise, let both queues grow until
+ * we reach the limit and cannot enqueue anymore
+ * (sacrifice delay to avoid drops).
+ */
+ if (q->drop_overload && dualpi2_squared_roll(q))
+ goto drop;
+ else
+ goto mark;
+ /* Scalable marking has a (prob * k) probability */
+ } else if (dualpi2_roll(local_l_prob)) {
+ goto mark;
+ }
+ /* Apply classic marking with a (prob * prob) probability.
+ * Force drops for ECN-capable traffic on overload.
+ */
+ } else if (dualpi2_squared_roll(q)) {
+ if (dualpi2_skb_cb(skb)->ect &&
+ !dualpi2_is_overloaded(local_l_prob))
+ goto mark;
+ else
+ goto drop;
+ }
+ return false;
+
+mark:
+ dualpi2_mark(q, skb);
+ return false;
+
+drop:
+ return true;
+}
+
+static void dualpi2_skb_classify(struct dualpi2_sched_data *q,
+ struct sk_buff *skb)
+{
+ struct dualpi2_skb_cb *cb = dualpi2_skb_cb(skb);
+ int wlen = skb_network_offset(skb);
+
+ switch (tc_skb_protocol(skb)) {
+ case htons(ETH_P_IP):
+ wlen += sizeof(struct iphdr);
+ if (!pskb_may_pull(skb, wlen) ||
+ skb_try_make_writable(skb, wlen))
+ goto not_ecn;
+
+ cb->ect = ipv4_get_dsfield(ip_hdr(skb)) & INET_ECN_MASK;
+ break;
+ case htons(ETH_P_IPV6):
+ wlen += sizeof(struct ipv6hdr);
+ if (!pskb_may_pull(skb, wlen) ||
+ skb_try_make_writable(skb, wlen))
+ goto not_ecn;
+
+ cb->ect = ipv6_get_dsfield(ipv6_hdr(skb)) & INET_ECN_MASK;
+ break;
+ default:
+ goto not_ecn;
+ }
+ cb->l4s = (cb->ect & q->ecn_mask) != 0;
+ return;
+
+not_ecn:
+ /* Not ECN capable or not non pullable/writable packets can only be
+ * dropped hence go the the classic queue.
+ */
+ cb->ect = INET_ECN_NOT_ECT;
+ cb->l4s = 0;
+}
+
+static int dualpi2_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+ struct sk_buff **to_free)
+{
+ struct dualpi2_sched_data *q = qdisc_priv(sch);
+ int err;
+
+ if (unlikely(qdisc_qlen(sch) >= sch->limit)) {
+ qdisc_qstats_overlimit(sch);
+ err = NET_XMIT_DROP;
+ goto drop;
+ }
+
+ dualpi2_skb_classify(q, skb);
+
+ /* drop early if configured */
+ if (q->drop_early && must_drop(sch, q, skb)) {
+ err = NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
+ goto drop;
+ }
+
+ dualpi2_skb_cb(skb)->ts = ktime_get_ns();
+
+ if (qdisc_qlen(sch) > q->maxq)
+ q->maxq = qdisc_qlen(sch);
+
+ if (skb_is_l4s(skb)) {
+ /* Only apply the step if a queue is building up */
+ dualpi2_skb_cb(skb)->apply_step = qdisc_qlen(q->l_queue) > 1;
+ /* Keep the overall qdisc stats consistent */
+ ++sch->q.qlen;
+ qdisc_qstats_backlog_inc(sch, skb);
+ ++q->packets_in_l;
+ return qdisc_enqueue_tail(skb, q->l_queue);
+ }
+ ++q->packets_in_c;
+ return qdisc_enqueue_tail(skb, sch);
+
+drop:
+ qdisc_drop(skb, sch, to_free);
+ return err;
+}
+
+static struct sk_buff *dualpi2_qdisc_dequeue(struct Qdisc *sch)
+{
+ struct dualpi2_sched_data *q = qdisc_priv(sch);
+ struct sk_buff *skb;
+ int qlen_c, credit_change;
+
+pick_packet:
+ /* L queue packets are also accounted for in qdisc_qlen(sch)! */
+ qlen_c = qdisc_qlen(sch) - qdisc_qlen(q->l_queue);
+ skb = NULL;
+ /* We can drop after qdisc_dequeue_head() calls.
+ * Manage statistics by hand to keep them consistent if that happens.
+ */
+ if (qdisc_qlen(q->l_queue) > 0 &&
+ (qlen_c <= 0 || q->c_protection.credit <= 0)) {
+ /* Dequeue and increase the credit by wc if qlen_c != 0 */
+ skb = __qdisc_dequeue_head(&q->l_queue->q);
+ credit_change = qlen_c ?
+ q->c_protection.wc * qdisc_pkt_len(skb) : 0;
+ /* The global backlog will be updated later. */
+ qdisc_qstats_backlog_dec(q->l_queue, skb);
+ /* Propagate the dequeue to the global stats. */
+ --sch->q.qlen;
+ } else if (qlen_c > 0) {
+ /* Dequeue and decrease the credit by wl if qlen_l != 0 */
+ skb = __qdisc_dequeue_head(&sch->q);
+ credit_change = qdisc_qlen(q->l_queue) ?
+ (s32)(-1) * q->c_protection.wl * qdisc_pkt_len(skb) : 0;
+ } else {
+ dualpi2_reset_c_protection(q);
+ goto exit;
+ }
+ qdisc_qstats_backlog_dec(sch, skb);
+
+ /* Drop on dequeue? */
+ if (!q->drop_early && must_drop(sch, q, skb)) {
+ ++q->deferred_drops.cnt;
+ q->deferred_drops.len += qdisc_pkt_len(skb);
+ consume_skb(skb);
+ qdisc_qstats_drop(sch);
+ /* try next packet */
+ goto pick_packet;
+ }
+
+ /* Apply the Step AQM to packets coming out of the L queue. */
+ if (skb_is_l4s(skb)) {
+ u64 qdelay = 0;
+
+ if (q->step.in_packets)
+ qdelay = qdisc_qlen(q->l_queue);
+ else
+ qdelay = skb_sojourn_time(skb, ktime_get_ns());
+ /* Apply the step */
+ if (likely(dualpi2_skb_cb(skb)->apply_step) &&
+ qdelay > q->step.thresh) {
+ dualpi2_mark(q, skb);
+ ++q->step_marks;
+ }
+ qdisc_bstats_update(q->l_queue, skb);
+ }
+
+ q->c_protection.credit += credit_change;
+ qdisc_bstats_update(sch, skb);
+
+exit:
+ /* We cannot call qdisc_tree_reduce_backlog() if our qlen is 0,
+ * or HTB crashes.
+ */
+ if (q->deferred_drops.cnt && qdisc_qlen(sch)) {
+ qdisc_tree_reduce_backlog(sch, q->deferred_drops.cnt,
+ q->deferred_drops.len);
+ q->deferred_drops.cnt = 0;
+ q->deferred_drops.len = 0;
+ }
+ return skb;
+}
+
+static s64 __scale_delta(u64 diff)
+{
+ do_div(diff, (1 << (ALPHA_BETA_GRANULARITY + 1)) - 1);
+ return diff;
+}
+
+static u32 calculate_probability(struct Qdisc *sch)
+{
+ struct dualpi2_sched_data *q = qdisc_priv(sch);
+ u64 qdelay, qdelay_old, now;
+ u32 new_prob;
+ s64 delta;
+
+ qdelay_old = max_t(u64, q->qdelay_c, q->qdelay_l);
+ now = ktime_get_ns();
+ q->qdelay_l = qdelay_in_ns(q->l_queue, now);
+ q->qdelay_c = qdelay_in_ns(sch, now);
+ qdelay = max_t(u64, q->qdelay_c, q->qdelay_l);
+ /* Alpha and beta take at most 32b, i.e, the delay difference would
+ * overflow for queueing delay differences > ~4.2sec.
+ */
+ delta = ((s64)qdelay - q->pi2.target) * q->pi2.alpha;
+ delta += ((s64)qdelay - qdelay_old) * q->pi2.beta;
+ /* Prevent overflow */
+ if (delta > 0) {
+ new_prob = __scale_delta(delta) + q->pi2.prob;
+ if (new_prob < q->pi2.prob)
+ new_prob = MAX_PROB;
+ } else {
+ new_prob = q->pi2.prob - __scale_delta(delta * -1);
+ /* Prevent underflow */
+ if (new_prob > q->pi2.prob)
+ new_prob = 0;
+ }
+ /* If we do not drop on overload, ensure we cap the L4S probability to
+ * 100% to keep window fairness when overflowing.
+ */
+ if (!q->drop_overload)
+ return min_t(u32, new_prob, MAX_PROB / q->coupling_factor);
+ return new_prob;
+}
+
+static void dualpi2_timer(struct timer_list *timer)
+{
+ struct dualpi2_sched_data *q = from_timer(q, timer, pi2.timer);
+ struct Qdisc *sch = q->sch;
+ spinlock_t *root_lock; /* Lock to access the head of both queues. */
+
+ root_lock = qdisc_lock(qdisc_root_sleeping(sch));
+ spin_lock(root_lock);
+
+ q->pi2.prob = calculate_probability(sch);
+ mod_timer(&q->pi2.timer, jiffies + q->pi2.tupdate);
+
+ spin_unlock(root_lock);
+}
+
+static const struct nla_policy dualpi2_policy[TCA_DUALPI2_MAX + 1] = {
+ [TCA_DUALPI2_LIMIT] = {.type = NLA_U32},
+ [TCA_DUALPI2_TARGET] = {.type = NLA_U32},
+ [TCA_DUALPI2_TUPDATE] = {.type = NLA_U32},
+ [TCA_DUALPI2_ALPHA] = {.type = NLA_U32},
+ [TCA_DUALPI2_BETA] = {.type = NLA_U32},
+ [TCA_DUALPI2_STEP_THRESH] = {.type = NLA_U32},
+ [TCA_DUALPI2_STEP_PACKETS] = {.type = NLA_U8},
+ [TCA_DUALPI2_COUPLING] = {.type = NLA_U8},
+ [TCA_DUALPI2_DROP_OVERLOAD] = {.type = NLA_U8},
+ [TCA_DUALPI2_DROP_EARLY] = {.type = NLA_U8},
+ [TCA_DUALPI2_C_PROTECTION] = {.type = NLA_U8},
+ [TCA_DUALPI2_ECN_MASK] = {.type = NLA_U8},
+};
+
+static int dualpi2_change(struct Qdisc *sch, struct nlattr *opt,
+ struct netlink_ext_ack *extack)
+{
+ struct dualpi2_sched_data *q = qdisc_priv(sch);
+ struct nlattr *tb[TCA_DUALPI2_MAX + 1];
+ unsigned int old_qlen, dropped = 0;
+ int err;
+
+ if (!opt)
+ return -EINVAL;
+ err = nla_parse_nested_deprecated(tb, TCA_DUALPI2_MAX, opt,
+ dualpi2_policy, extack);
+ if (err < 0)
+ return err;
+
+ sch_tree_lock(sch);
+
+ if (tb[TCA_DUALPI2_LIMIT]) {
+ u32 limit = nla_get_u32(tb[TCA_DUALPI2_LIMIT]);
+
+ if (!limit) {
+ NL_SET_ERR_MSG_ATTR(extack, tb[TCA_DUALPI2_LIMIT],
+ "limit must be greater than 0 !");
+ return -EINVAL;
+ }
+ sch->limit = limit;
+ }
+
+ if (tb[TCA_DUALPI2_TARGET])
+ q->pi2.target = (u64)nla_get_u32(tb[TCA_DUALPI2_TARGET]) *
+ NSEC_PER_USEC;
+
+ if (tb[TCA_DUALPI2_TUPDATE]) {
+ u64 tupdate =
+ usecs_to_jiffies(nla_get_u32(tb[TCA_DUALPI2_TUPDATE]));
+
+ if (!tupdate) {
+ NL_SET_ERR_MSG_ATTR(extack, tb[TCA_DUALPI2_TUPDATE],
+ "tupdate cannot be 0 jiffies!");
+ return -EINVAL;
+ }
+ q->pi2.tupdate = tupdate;
+ }
+
+ if (tb[TCA_DUALPI2_ALPHA]) {
+ u32 alpha = nla_get_u32(tb[TCA_DUALPI2_ALPHA]);
+
+ if (alpha > ALPHA_BETA_MAX) {
+ NL_SET_ERR_MSG_ATTR(extack, tb[TCA_DUALPI2_ALPHA],
+ "alpha is too large!");
+ return -EINVAL;
+ }
+ q->pi2.alpha = dualpi2_scale_alpha_beta(alpha);
+ }
+
+ if (tb[TCA_DUALPI2_BETA]) {
+ u32 beta = nla_get_u32(tb[TCA_DUALPI2_BETA]);
+
+ if (beta > ALPHA_BETA_MAX) {
+ NL_SET_ERR_MSG_ATTR(extack, tb[TCA_DUALPI2_BETA],
+ "beta is too large!");
+ return -EINVAL;
+ }
+ q->pi2.beta = dualpi2_scale_alpha_beta(beta);
+ }
+
+ if (tb[TCA_DUALPI2_STEP_THRESH])
+ q->step.thresh = nla_get_u32(tb[TCA_DUALPI2_STEP_THRESH]) *
+ NSEC_PER_USEC;
+
+ if (tb[TCA_DUALPI2_COUPLING]) {
+ u8 coupling = nla_get_u8(tb[TCA_DUALPI2_COUPLING]);
+
+ if (!coupling) {
+ NL_SET_ERR_MSG_ATTR(extack, tb[TCA_DUALPI2_COUPLING],
+ "Must use a non-zero coupling!");
+ return -EINVAL;
+ }
+ q->coupling_factor = coupling;
+ }
+
+ if (tb[TCA_DUALPI2_STEP_PACKETS])
+ q->step.in_packets = nla_get_u8(tb[TCA_DUALPI2_STEP_PACKETS]);
+
+ if (tb[TCA_DUALPI2_DROP_OVERLOAD])
+ q->drop_overload = nla_get_u8(tb[TCA_DUALPI2_DROP_OVERLOAD]);
+
+ if (tb[TCA_DUALPI2_DROP_EARLY])
+ q->drop_early = nla_get_u8(tb[TCA_DUALPI2_DROP_EARLY]);
+
+ if (tb[TCA_DUALPI2_C_PROTECTION]) {
+ u8 wc = nla_get_u8(tb[TCA_DUALPI2_C_PROTECTION]);
+
+ if (wc > MAX_WC) {
+ NL_SET_ERR_MSG_ATTR(extack,
+ tb[TCA_DUALPI2_C_PROTECTION],
+ "c_protection must be <= 100!");
+ return -EINVAL;
+ }
+ dualpi2_calculate_c_protection(sch, q, wc);
+ }
+
+ if (tb[TCA_DUALPI2_ECN_MASK])
+ q->ecn_mask = nla_get_u8(tb[TCA_DUALPI2_ECN_MASK]);
+
+ /* Drop excess packets if new limit is lower */
+ old_qlen = qdisc_qlen(sch);
+ while (qdisc_qlen(sch) > sch->limit) {
+ struct sk_buff *skb = __qdisc_dequeue_head(&sch->q);
+
+ dropped += qdisc_pkt_len(skb);
+ qdisc_qstats_backlog_dec(sch, skb);
+ rtnl_qdisc_drop(skb, sch);
+ }
+ qdisc_tree_reduce_backlog(sch, old_qlen - qdisc_qlen(sch), dropped);
+
+ sch_tree_unlock(sch);
+ return 0;
+}
+
+static void dualpi2_reset_default(struct dualpi2_sched_data *q)
+{
+ q->sch->limit = 10000; /* Holds 125ms at 1G */
+
+ q->pi2.target = 15 * NSEC_PER_MSEC;
+ q->pi2.tupdate = usecs_to_jiffies(16 * USEC_PER_MSEC);
+ q->pi2.alpha = dualpi2_scale_alpha_beta(41); /* ~0.16 Hz in 1/256th */
+ q->pi2.beta = dualpi2_scale_alpha_beta(819); /* ~3.2 Hz in 1/256th */
+ /* These values give a 10dB stability margin with max_rtt=100ms */
+
+ q->step.thresh = 1 * NSEC_PER_MSEC; /* 1ms */
+ q->step.in_packets = false; /* Step in time not packets */
+
+ dualpi2_calculate_c_protection(q->sch, q, 10); /* Defaults to wc = 10 */
+
+ q->ecn_mask = INET_ECN_ECT_1; /* l4s-id */
+ q->coupling_factor = 2; /* window fairness for equal RTTs */
+ q->drop_overload = true; /* Preserve latency by dropping on overload */
+ q->drop_early = false; /* PI2 drop on dequeue */
+}
+
+static int dualpi2_init(struct Qdisc *sch, struct nlattr *opt,
+ struct netlink_ext_ack *extack)
+{
+ struct dualpi2_sched_data *q = qdisc_priv(sch);
+
+ q->l_queue = qdisc_create_dflt(sch->dev_queue, &pfifo_qdisc_ops,
+ TC_H_MAKE(sch->handle, 1), extack);
+ if (!q->l_queue)
+ return -ENOMEM;
+
+ q->sch = sch;
+ dualpi2_reset_default(q);
+ timer_setup(&q->pi2.timer, dualpi2_timer, 0);
+
+ if (opt) {
+ int err = dualpi2_change(sch, opt, extack);
+
+ if (err)
+ return err;
+ }
+
+ mod_timer(&q->pi2.timer, (jiffies + HZ) >> 1);
+ return 0;
+}
+
+static int dualpi2_dump(struct Qdisc *sch, struct sk_buff *skb)
+{
+ struct nlattr *opts = nla_nest_start_noflag(skb, TCA_OPTIONS);
+ struct dualpi2_sched_data *q = qdisc_priv(sch);
+ u64 step_thresh = q->step.thresh;
+ u64 target_usec = q->pi2.target;
+
+ if (!opts)
+ goto nla_put_failure;
+
+ do_div(target_usec, NSEC_PER_USEC);
+ if (!q->step.in_packets)
+ do_div(step_thresh, NSEC_PER_USEC);
+
+ if (nla_put_u32(skb, TCA_DUALPI2_LIMIT, sch->limit) ||
+ nla_put_u32(skb, TCA_DUALPI2_TARGET, target_usec) ||
+ nla_put_u32(skb, TCA_DUALPI2_TUPDATE,
+ jiffies_to_usecs(q->pi2.tupdate)) ||
+ nla_put_u32(skb, TCA_DUALPI2_ALPHA,
+ dualpi2_unscale_alpha_beta(q->pi2.alpha)) ||
+ nla_put_u32(skb, TCA_DUALPI2_BETA,
+ dualpi2_unscale_alpha_beta(q->pi2.beta)) ||
+ nla_put_u32(skb, TCA_DUALPI2_STEP_THRESH, step_thresh) ||
+ nla_put_u8(skb, TCA_DUALPI2_COUPLING, q->coupling_factor) ||
+ nla_put_u8(skb, TCA_DUALPI2_DROP_OVERLOAD, q->drop_overload) ||
+ nla_put_u8(skb, TCA_DUALPI2_STEP_PACKETS, q->step.in_packets) ||
+ nla_put_u8(skb, TCA_DUALPI2_DROP_EARLY, q->drop_early) ||
+ nla_put_u8(skb, TCA_DUALPI2_C_PROTECTION, q->c_protection.wc) ||
+ nla_put_u8(skb, TCA_DUALPI2_ECN_MASK, q->ecn_mask))
+ goto nla_put_failure;
+
+ return nla_nest_end(skb, opts);
+
+nla_put_failure:
+ nla_nest_cancel(skb, opts);
+ return -1;
+}
+
+static int dualpi2_dump_stats(struct Qdisc *sch, struct gnet_dump *d)
+{
+ struct dualpi2_sched_data *q = qdisc_priv(sch);
+ u64 qdelay_c_usec = q->qdelay_c;
+ u64 qdelay_l_usec = q->qdelay_l;
+ struct tc_dualpi2_xstats st = {
+ .prob = q->pi2.prob,
+ .packets_in_c = q->packets_in_c,
+ .packets_in_l = q->packets_in_l,
+ .maxq = q->maxq,
+ .ecn_mark = q->ecn_mark,
+ .credit = q->c_protection.credit,
+ .step_marks = q->step_marks,
+ };
+
+ do_div(qdelay_c_usec, NSEC_PER_USEC);
+ do_div(qdelay_l_usec, NSEC_PER_USEC);
+ st.delay_c = qdelay_c_usec;
+ st.delay_l = qdelay_l_usec;
+ return gnet_stats_copy_app(d, &st, sizeof(st));
+}
+
+static void dualpi2_reset(struct Qdisc *sch)
+{
+ struct dualpi2_sched_data *q = qdisc_priv(sch);
+
+ qdisc_reset_queue(sch);
+ qdisc_reset_queue(q->l_queue);
+ q->qdelay_c = 0;
+ q->qdelay_l = 0;
+ q->pi2.prob = 0;
+ q->packets_in_c = 0;
+ q->packets_in_l = 0;
+ q->maxq = 0;
+ q->ecn_mark = 0;
+ q->step_marks = 0;
+ dualpi2_reset_c_protection(q);
+}
+
+static void dualpi2_destroy(struct Qdisc *sch)
+{
+ struct dualpi2_sched_data *q = qdisc_priv(sch);
+
+ q->pi2.tupdate = 0;
+ del_timer_sync(&q->pi2.timer);
+ if (q->l_queue)
+ qdisc_put(q->l_queue);
+}
+
+static struct Qdisc_ops dualpi2_qdisc_ops __read_mostly = {
+ .id = "dualpi2",
+ .priv_size = sizeof(struct dualpi2_sched_data),
+ .enqueue = dualpi2_qdisc_enqueue,
+ .dequeue = dualpi2_qdisc_dequeue,
+ .peek = qdisc_peek_dequeued,
+ .init = dualpi2_init,
+ .destroy = dualpi2_destroy,
+ .reset = dualpi2_reset,
+ .change = dualpi2_change,
+ .dump = dualpi2_dump,
+ .dump_stats = dualpi2_dump_stats,
+ .owner = THIS_MODULE,
+};
+
+static int __init dualpi2_module_init(void)
+{
+ return register_qdisc(&dualpi2_qdisc_ops);
+}
+
+static void __exit dualpi2_module_exit(void)
+{
+ unregister_qdisc(&dualpi2_qdisc_ops);
+}
+
+module_init(dualpi2_module_init);
+module_exit(dualpi2_module_exit);
+
+MODULE_DESCRIPTION("Dual Queue with Proportional Integral controller Improved with a Square (dualpi2) scheduler");
+MODULE_AUTHOR("Koen De Schepper");
+MODULE_AUTHOR("Olga Albisser");
+MODULE_AUTHOR("Henrik Steen");
+MODULE_AUTHOR("Olivier Tilmans");
+MODULE_LICENSE("GPL");
--
2.23.0

2019-08-23 12:30:22

by Dave Taht

[permalink] [raw]

Subject: Re: [PATCH net-next v5] sched: Add dualpi2 qdisc

This is vastly improved code, thank you!

1) Since we're still duking it out over the meaning of the bits - not
just the SCE thing, but as best as I can
tell (but could be wrong) the NQB idea wants to put something into the
l4s fast queue? Or is NQB supposed to
be a third queue?

In those cases, the ecn_mask should just be mask.

2) Is the intent to make the drop probability 0 by default? (10 in the
pie rfc, not mentioned in the l4s rfc as yet)

3) has this been tested on a hw mq system as yet? (10gigE is typically
64 queues)

2019-08-23 23:21:03

by Tilmans, Olivier (Nokia - BE/Antwerp)

[permalink] [raw]

Subject: Re: [PATCH net-next v5] sched: Add dualpi2 qdisc

> 1) Since we're still duking it out over the meaning of the bits - not
> just the SCE thing, but as best as I can
> tell (but could be wrong) the NQB idea wants to put something into the
> l4s fast queue? Or is NQB supposed to
> be a third queue?

We can add support for NQB in the future, by expanding the
dualpi2_skb_classify() function. This is however out of scope at the
moment as NQB is not yet adopted by the TSV WG. I'd guess we may want more
than just the NQB DSCP codepoint in the L queue, which then warrant
another way to classify traffic, e.g., using tc filter hints.

> In those cases, the ecn_mask should just be mask.

That is actually what it is at the moment: a mask on the two ecn bits.

> 2) Is the intent to make the drop probability 0 by default? (10 in the
> pie rfc, not mentioned in the l4s rfc as yet)

I assume you are referring to §5.1 of the PIE RFC, i.e., the switch to
pure drop once the computed marking probability is >10%?

The default for dualpi2 is also to enter a pure-drop mode on overload.
More precisely, we define overload as reaching a marking probability of
100% in the L queue, meaning an internal PI probability of 50% (as it
gets mutiplied by the coupling factor which defaults to 2).
This is equivalent to a PIE probability of 25% (as the classic queue gets a
squared probability).
This drop mode means that packets in both queues will be subject to
random drops with a PI^2 probability. Additionally, all packets not
dropped in the L queue are CE marked.

We used to have a parameter to configure this overload threshold (IIRC
it was still in the pre-v4 patches), but found no real use for lowering
it, hence its removal.

Note that the drop on overload can be disabled, resulting in increased
latencies in both queues, 100% CE marking in the L queue, and eventually
a taildrop behaviour once the packet limit is reached.

> 3) has this been tested on a hw mq system as yet? (10gigE is typically
> 64 queues)

Yes, in a setup where 1/32/64/128VMs were behind an Intel X540-*, which indeed
has 64 internal queues. The VMs use a mix of long/short cubic/DCTCP connections
towards another server. I could not think about another use-case where a 10G
software switch would prove to be a bottleneck, i.e., where a queue would
happen.
The qdisc is however not optimized for mq systems, could it cause performance
degradation if the server was severely resource constrained?

Also, ensuring it was able to saturate 10G meant gro was required on the
hypervisor, thus that the step threshold of dualpi2 has to be increased to
compensate for those large bursts. Maybe that is where being mq-aware would
help, i.e., by instantiating one dualpi2 instance per HW queue?
The AQM scheme itself is CPU friendly (lighter than PIE)--i.e., computing the
probability takes <10 arithmetic ops and 5 comparisons once every 16ms, while
enqueue/dequeue can involve ~10 comparisons and at most 2 rng calls)--so
should not increase the load too much issues if it was duplicated.

Best,
Olivier

2019-08-27 22:04:40

by David Miller

[permalink] [raw]

Subject: Re: [PATCH net-next v5] sched: Add dualpi2 qdisc

From: "Tilmans, Olivier (Nokia - BE/Antwerp)" <[email protected]>
Date: Thu, 22 Aug 2019 08:10:48 +0000

> +static inline struct dualpi2_skb_cb *dualpi2_skb_cb(struct sk_buff *skb)

Please do not use the inline keyword in foo.c files, let the compiler decide.

> +static struct sk_buff *dualpi2_qdisc_dequeue(struct Qdisc *sch)
> +{
> + struct dualpi2_sched_data *q = qdisc_priv(sch);
> + struct sk_buff *skb;
> + int qlen_c, credit_change;

Reverse christmas tree here, please.

> +static void dualpi2_timer(struct timer_list *timer)
> +{
> + struct dualpi2_sched_data *q = from_timer(q, timer, pi2.timer);
> + struct Qdisc *sch = q->sch;
> + spinlock_t *root_lock; /* Lock to access the head of both queues. */

Likewise, and please remove this comment it makes the variable declarations
look odd.

2019-08-28 16:57:25

by Dave Taht

[permalink] [raw]

Subject: Re: [PATCH net-next v5] sched: Add dualpi2 qdisc

On Wed, Aug 28, 2019 at 7:00 AM Bob Briscoe <[email protected]> wrote:
>
> Olivier, Dave,
>
> On 23/08/2019 13:59, Tilmans, Olivier (Nokia - BE/Antwerp) wrote:
>
> as best as I can
> tell (but could be wrong) the NQB idea wants to put something into the
> l4s fast queue? Or is NQB supposed to
> be a third queue?
>
> NQB is not supported in this release of the code. But FYI, it's not for a third queue.

At the time of my code review of dualpi I had not gone back to review
the NQB draft fully.

> We can add support for NQB in the future, by expanding the
> dualpi2_skb_classify() function. This is however out of scope at the
> moment as NQB is not yet adopted by the TSV WG. I'd guess we may want more

> than just the NQB DSCP codepoint in the L queue, which then warrant
> another way to classify traffic, e.g., using tc filter hints.

Yes, you'll find find folk are fans of being able to put tc (and ebpf)
filters in front of various qdiscs for classification, logging, and/or
dropping behavior.

A fairly typical stanza is here:
https://github.com/torvalds/linux/blob/master/net/sched/sch_sfq.c#L171
to line 193.

> The IETF adopted the NQB draft at the meeting just passed in July, but the draft has not yet been updated to reflect that: https://tools.ietf.org/html/draft-white-tsvwg-nqb-02

Hmmm... no. I think oliver's statement was correct.

NQB was put into the "call for adoption into tsvwg" state (
https://mailarchive.ietf.org/arch/msg/tsvwg/fjyYQgU9xQCNalwPO7v9-al6mGk
) in the tsvwg aug 21st, which
doesn't mean "adopted by the ietf", either. In response to that call
several folk did put in (rather pithy),
comments on the current state of the NQB idea and internet draft, starting here:

https://mailarchive.ietf.org/arch/msg/tsvwg/hZGjm899t87YZl9JJUOWQq4KBsk

For those here that are not familiar with IETF processes (and there
are many!) there are "internet drafts" that may or may not become
working group items, that if they become accepted by the working group
may or may not evolve to become actual RFCs. Unlike lkml usage where
we use RFC in its original meaning as a mere request for comments,
there are several classes of IETF RFC - standards track, experimental,
and informational - whenever they are adopted and published by the
ietf.

There are RFCs for how they do RFCs, and BCPs and other TLAs, and if
you really want to know more about how the ietf processes actually
work, please contact me off list. Anyway...

Much of the experimental L4S architecture itself (of which NQB MAY
become part, and dualpi/tcpprague/etc are) is presently an accepted
tsvwg wg item with a list of 11 problems on the bug database here (
https://trac.ietf.org/trac/tsvwg/report/1?sort=ticket&asc=1&page=1 ).
IMHO it's not currently near last call for standardization as a set of
experimental RFCs.

L4S takes advantage of several RFCs that have
indeed been published as experimental, notably, RFC8311, which too few
have read as yet.

While using up ECT1 in the L4S code as an identifier and not as a
congestion indicator is very controversial for me (
https://lwn.net/Articles/783673/ ), AND I'd rather it not be baked
into the linux api for dualpi should this identifier not be chosen by
the wg (thus my suggestion of a mask or lookup table)...

... I also dearly would like both sides of this code - dualpi and tcp
prague - in a simultaneously testable and high quality state. Without
that, many core ideas in dualpi cannot be tested, nor objectively
evaluated against other tcps and qdiscs using rfc3168 behavior along
the path. Multiple experimental ideas in RFC8311 (such as those in
section 4.3) have also not been re-evaluated in any context.

Is the known to work reference codebase for "tcp prague" still 3.19 based?

> The draft requests 0x2A (decimal 42) as the DSCP but, until the IETF converges on a specific DSCP for NQB, I believe we should not code in a default classifier anyway.
>
>
>
> Bob
>
> --
> ________________________________________________________________
> Bob Briscoe http://bobbriscoe.net/

--

Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-205-9740

2019-08-29 22:20:18

by Bob Briscoe

[permalink] [raw]

Subject: Re: [PATCH net-next v5] sched: Add dualpi2 qdisc

Dave,

On 28/08/2019 17:55, Dave Taht wrote:
> On Wed, Aug 28, 2019 at 7:00 AM Bob Briscoe <[email protected]> wrote:
>> Olivier, Dave,
>>
>> On 23/08/2019 13:59, Tilmans, Olivier (Nokia - BE/Antwerp) wrote:
>>
>> as best as I can
>> tell (but could be wrong) the NQB idea wants to put something into the
>> l4s fast queue? Or is NQB supposed to
>> be a third queue?
>>
>> NQB is not supported in this release of the code. But FYI, it's not for a third queue.
> At the time of my code review of dualpi I had not gone back to review
> the NQB draft fully.
>
>> We can add support for NQB in the future, by expanding the
>> dualpi2_skb_classify() function. This is however out of scope at the
>> moment as NQB is not yet adopted by the TSV WG. I'd guess we may want more
>> than just the NQB DSCP codepoint in the L queue, which then warrant
>> another way to classify traffic, e.g., using tc filter hints.
> Yes, you'll find find folk are fans of being able to put tc (and ebpf)
> filters in front of various qdiscs for classification, logging, and/or
> dropping behavior.
>
> A fairly typical stanza is here:
> https://github.com/torvalds/linux/blob/master/net/sched/sch_sfq.c#L171
> to line 193.
Yes, I got a student to add hooks for the Linux classification
architecture (either adding more, or overriding the defaults) a couple
of years ago, along with creating a classful structure. But his
unfinished branch got left dangling once he graduated and is now way out
of date. it's still our intention to take that direction tho.

>
>> The IETF adopted the NQB draft at the meeting just passed in July, but the draft has not yet been updated to reflect that: https://tools.ietf.org/html/draft-white-tsvwg-nqb-02
> Hmmm... no. I think oliver's statement was correct.
>
> NQB was put into the "call for adoption into tsvwg" state (
> https://mailarchive.ietf.org/arch/msg/tsvwg/fjyYQgU9xQCNalwPO7v9-al6mGk
> ) in the tsvwg aug 21st, which
> doesn't mean "adopted by the ietf", either.
You're right.

I've been away from all this for a while. In the tsvwg meeting there
were perhaps a couple of dozen folks stating support and no-one against,
so I had (wrongly) extrapolated from that - I should have checked the
status of the ML discussion first.

> In response to that call
> several folk did put in (rather pithy),
> comments on the current state of the NQB idea and internet draft, starting here:
>
> https://mailarchive.ietf.org/arch/msg/tsvwg/hZGjm899t87YZl9JJUOWQq4KBsk

> While using up ECT1 in the L4S code as an identifier and not as a
> congestion indicator is very controversial for me (
> https://lwn.net/Articles/783673/ ), AND I'd rather it not be baked
> into the linux api for dualpi should this identifier not be chosen by
> the wg (thus my suggestion of a mask or lookup table)...
That ship has sailed. You can consider it controversial if you want, but
the tsvwg decided to use ECT(1) as an identifier for L4S after long
discussions back in 2016. Years of a large number of people's work was
predicated on that decision. So the dualpi2 code reflects the way the
IETF is approaching this.

>
> ... I also dearly would like both sides of this code - dualpi and tcp
> prague - in a simultaneously testable and high quality state. Without
> that, many core ideas in dualpi cannot be tested, nor objectively
> evaluated against other tcps and qdiscs using rfc3168 behavior along
> the path. Multiple experimental ideas in RFC8311 (such as those in
> section 4.3) have also not been re-evaluated in any context.
We're working on that - top priority now.
>
> Is the known to work reference codebase for "tcp prague" still 3.19 based?
It is, but Olivier recently found the elusive cause of the problem that
made later versions bursty. So we're getting close.

Bob
>> Bob
>>
>> --
>> ________________________________________________________________
>> Bob Briscoe http://bobbriscoe.net/
>
>
> --
>
> Dave Täht
> CTO, TekLibre, LLC
> http://www.teklibre.com
> Tel: 1-831-205-9740

--
________________________________________________________________
Bob Briscoe http://bobbriscoe.net/