I was tickled to see that expiring packets based on 'time in queue'
was already a
key design feature of the IPN,
http://www.science20.com/interwebometry/lessons_nasa_can_learn_internet-84861
and the idea was suggested also independently in saturday's CACM
article's comments...
http://queue.acm.org/detail.cfm?id=2071893
Making decisions based on time in queue (TiQ), rather than length of
queue, would
seem to be a win, especially for wireless, but also for that does
'soft' traffic
shaping with HTB-like qdiscs and sub qdiscs. Anything that is not running
at GigE plus speeds would benefit.
... since basically what's been happening with bufferbloat is a too early
implementation of the IPN, with detours between here and the moon!
... and so far I haven't seen any major theoretical holes in with TiQ, except
for deciding as to how long is too long as to consider a packet as 'expired',
(which can be measured in ms or 10s of ms), and having reasonably
monotonic time.
I just wish I (or someone) could come up with a way to implement it in Linux
without multiple layer violations. skb->queuestamp has a nice ring to it, but
when I look at this portion of the stack I freely admit quivering in ignorance
and fear.
I did a bit of work on a set of timestamping fifos, but realized that
it was the
entire queue's duration from entrance to exit (through multiple scheduling
qdiscs) was what needed to be measured, and the drop/no drop decision
needs to be made as late as possible - and preferably, the queue being
pulled from needs the next packet pulled forward so as to inform the
receiver that congestion is happening...
And Eric dumazet also produced a preliminary patch a few weeks back
that tied timestamping to before the head of a queue, but that tried to use a
reserved field in the skb that appears from points A to Z is not guaranteed
to be preserved.
Thoughts?
--
Dave T?ht
SKYPE: davetaht
US Tel: 1-239-829-5608
FR Tel: 0638645374
http://www.bufferbloat.net
Adaptative RED AQM for linux, based on paper from Sally FLoyd,
Ramakrishna Gummadi, and Scott Shenker, August 2001 :
http://icir.org/floyd/papers/adaptiveRed.pdf
Goal of Adaptative RED is to make max_p a dynamic value between 1% and
50% to reach the target average queue : (max_th - min_th) / 2
Every 500 ms:
if (avg > target and max_p <= 0.5)
increase max_p : max_p += alpha;
else if (avg < target and max_p >= 0.01)
decrease max_p : max_p *= beta;
target :[min_th + 0.4*(min_th - max_th),
min_th + 0.6*(min_th - max_th)].
alpha : min(0.01, max_p / 4)
beta : 0.9
max_P is a Q0.32 fixed point number (unsigned, with 32 bits mantissa)
Changes against our RED implementation are :
max_p is no longer a negative power of two (1/(2^Plog)), but a Q0.32
fixed point number, to allow full range described in Adatative paper.
To deliver a random number, we now use a reciprocal divide (thats really
a multiply), but this operation is done once per marked/droped packet
when in RED_BETWEEN_TRESH window, so added cost (compared to previous
AND operation) is near zero.
dump operation gives current max_p value in a new TCA_RED_MAX_P
attribute.
Example on a 10Mbit link :
tc qdisc add dev $DEV parent 1:1 handle 10: est 1sec 8sec red \
limit 400000 min 30000 max 90000 avpkt 1000 \
burst 55 ecn adaptative bandwidth 10Mbit
# tc -s -d qdisc show dev eth3
...
qdisc red 10: parent 1:1 limit 400000b min 30000b max 90000b ecn
adaptative ewma 5 max_p=0.113335 Scell_log 15
Sent 50414282 bytes 34504 pkt (dropped 35, overlimits 1392 requeues 0)
rate 9749Kbit 831pps backlog 72056b 16p requeues 0
marked 1357 early 35 pdrop 0 other 0
Signed-off-by: Eric Dumazet <[email protected]>
---
include/linux/pkt_sched.h | 6 +-
include/net/red.h | 101 +++++++++++++++++++++++++++++-------
lib/reciprocal_div.c | 2
net/sched/sch_red.c | 21 +++++++
4 files changed, 111 insertions(+), 19 deletions(-)
diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h
index fb556dc..e41e0d4 100644
--- a/include/linux/pkt_sched.h
+++ b/include/linux/pkt_sched.h
@@ -181,6 +181,7 @@ enum {
TCA_RED_UNSPEC,
TCA_RED_PARMS,
TCA_RED_STAB,
+ TCA_RED_MAX_P,
__TCA_RED_MAX,
};
@@ -194,8 +195,9 @@ struct tc_red_qopt {
unsigned char Plog; /* log(P_max/(qth_max-qth_min)) */
unsigned char Scell_log; /* cell size for idle damping */
unsigned char flags;
-#define TC_RED_ECN 1
-#define TC_RED_HARDDROP 2
+#define TC_RED_ECN 1
+#define TC_RED_HARDDROP 2
+#define TC_RED_ADAPTATIVE 4
};
struct tc_red_xstats {
diff --git a/include/net/red.h b/include/net/red.h
index b72a3b8..f4e9533 100644
--- a/include/net/red.h
+++ b/include/net/red.h
@@ -5,6 +5,7 @@
#include <net/pkt_sched.h>
#include <net/inet_ecn.h>
#include <net/dsfield.h>
+#include <linux/reciprocal_div.h>
/* Random Early Detection (RED) algorithm.
=======================================
@@ -87,6 +88,29 @@
etc.
*/
+/*
+ * Adaptative RED : An Algorithm for Increasing the Robustness of RED's AQM
+ * (Sally FLoyd, Ramakrishna Gummadi, and Scott Shenker) August 2001
+ *
+ * Every 500 ms:
+ * if (avg > target and max_p <= 0.5)
+ * increase max_p : max_p += alpha;
+ * else if (avg < target and max_p >= 0.01)
+ * decrease max_p : max_p *= beta;
+ *
+ * target :[qth_min + 0.4*(qth_min - qth_max),
+ * qth_min + 0.6*(qth_min - qth_max)].
+ * alpha : min(0.01, max_p / 4)
+ * beta : 0.9
+ * max_P is a Q0.32 fixed point number (with 32 bits mantissa)
+ * max_P between 0.01 and 0.5 (1% - 50%) [ Its no longer a negative power of two ]
+ */
+#define RED_ONE_PERCENT ((u32)DIV_ROUND_CLOSEST(1ULL<<32, 100))
+
+#define MAX_P_MIN (1 * RED_ONE_PERCENT)
+#define MAX_P_MAX (50 * RED_ONE_PERCENT)
+#define MAX_P_ALPHA(val) min(MAX_P_MIN, val / 4)
+
#define RED_STAB_SIZE 256
#define RED_STAB_MASK (RED_STAB_SIZE - 1)
@@ -101,10 +125,14 @@ struct red_stats {
struct red_parms {
/* Parameters */
- u32 qth_min; /* Min avg length threshold: A scaled */
- u32 qth_max; /* Max avg length threshold: A scaled */
+ u32 qth_min; /* Min avg length threshold: Wlog scaled */
+ u32 qth_max; /* Max avg length threshold: Wlog scaled */
u32 Scell_max;
- u32 Rmask; /* Cached random mask, see red_rmask */
+ u32 max_P; /* probability, [0 .. 1.0] 32 scaled */
+ u32 max_P_reciprocal; /* reciprocal_value(max_P / qth_delta) */
+ u32 qth_delta; /* max_th - min_th */
+ u32 target_min; /* min_th + 0.4*(max_th - min_th) */
+ u32 target_max; /* min_th + 0.6*(max_th - min_th) */
u8 Scell_log;
u8 Wlog; /* log(W) */
u8 Plog; /* random number bits */
@@ -115,19 +143,22 @@ struct red_parms {
number generation */
u32 qR; /* Cached random number */
- unsigned long qavg; /* Average queue length: A scaled */
+ unsigned long qavg; /* Average queue length: Wlog scaled */
ktime_t qidlestart; /* Start of current idle period */
};
-static inline u32 red_rmask(u8 Plog)
+static inline u32 red_maxp(u8 Plog)
{
- return Plog < 32 ? ((1 << Plog) - 1) : ~0UL;
+ return Plog < 32 ? (~0U >> Plog) : ~0U;
}
+
static inline void red_set_parms(struct red_parms *p,
u32 qth_min, u32 qth_max, u8 Wlog, u8 Plog,
u8 Scell_log, u8 *stab)
{
+ int delta = qth_max - qth_min;
+
/* Reset average queue length, the value is strictly bound
* to the parameters below, reseting hurts a bit but leaving
* it might result in an unreasonable qavg for a while. --TGR
@@ -139,14 +170,29 @@ static inline void red_set_parms(struct red_parms *p,
p->qth_max = qth_max << Wlog;
p->Wlog = Wlog;
p->Plog = Plog;
- p->Rmask = red_rmask(Plog);
+ if (delta < 0)
+ delta = 1;
+ p->qth_delta = delta;
+ p->max_P = red_maxp(Plog);
+ p->max_P *= delta; /* max_P = (qth_max-qth_min)/2^Plog */
+
+ p->max_P_reciprocal = reciprocal_value(p->max_P / delta);
+
+ /* RED Adaptative target :
+ * [min_th + 0.4*(min_th - max_th),
+ * min_th + 0.6*(min_th - max_th)].
+ */
+ delta /= 5;
+ p->target_min = qth_min + 2*delta;
+ p->target_max = qth_min + 3*delta;
+
p->Scell_log = Scell_log;
p->Scell_max = (255 << Scell_log);
memcpy(p->Stab, stab, sizeof(p->Stab));
}
-static inline int red_is_idling(struct red_parms *p)
+static inline int red_is_idling(const struct red_parms *p)
{
return p->qidlestart.tv64 != 0;
}
@@ -168,7 +214,7 @@ static inline void red_restart(struct red_parms *p)
p->qcount = -1;
}
-static inline unsigned long red_calc_qavg_from_idle_time(struct red_parms *p)
+static inline unsigned long red_calc_qavg_from_idle_time(const struct red_parms *p)
{
s64 delta = ktime_us_delta(ktime_get(), p->qidlestart);
long us_idle = min_t(s64, delta, p->Scell_max);
@@ -215,7 +261,7 @@ static inline unsigned long red_calc_qavg_from_idle_time(struct red_parms *p)
}
}
-static inline unsigned long red_calc_qavg_no_idle_time(struct red_parms *p,
+static inline unsigned long red_calc_qavg_no_idle_time(const struct red_parms *p,
unsigned int backlog)
{
/*
@@ -230,7 +276,7 @@ static inline unsigned long red_calc_qavg_no_idle_time(struct red_parms *p,
return p->qavg + (backlog - (p->qavg >> p->Wlog));
}
-static inline unsigned long red_calc_qavg(struct red_parms *p,
+static inline unsigned long red_calc_qavg(const struct red_parms *p,
unsigned int backlog)
{
if (!red_is_idling(p))
@@ -239,23 +285,24 @@ static inline unsigned long red_calc_qavg(struct red_parms *p,
return red_calc_qavg_from_idle_time(p);
}
-static inline u32 red_random(struct red_parms *p)
+
+static inline u32 red_random(const struct red_parms *p)
{
- return net_random() & p->Rmask;
+ return reciprocal_divide(net_random(), p->max_P_reciprocal);
}
-static inline int red_mark_probability(struct red_parms *p, unsigned long qavg)
+static inline int red_mark_probability(const struct red_parms *p, unsigned long qavg)
{
/* The formula used below causes questions.
- OK. qR is random number in the interval 0..Rmask
+ OK. qR is random number in the interval
+ (0..1/max_P)*(qth_max-qth_min)
i.e. 0..(2^Plog). If we used floating point
arithmetics, it would be: (2^Plog)*rnd_num,
where rnd_num is less 1.
Taking into account, that qavg have fixed
- point at Wlog, and Plog is related to max_P by
- max_P = (qth_max-qth_min)/2^Plog; two lines
+ point at Wlog, two lines
below have the following floating point equivalent:
max_P*(qavg - qth_min)/(qth_max-qth_min) < rnd/qcount
@@ -315,4 +362,24 @@ static inline int red_action(struct red_parms *p, unsigned long qavg)
return RED_DONT_MARK;
}
+static inline void red_adaptative_algo(struct red_parms *p)
+{
+ unsigned long qavg;
+ u32 max_p_delta;
+
+ qavg = p->qavg;
+ if (red_is_idling(p))
+ qavg = red_calc_qavg_from_idle_time(p);
+
+ /* p->qavg is fixed point number with point at Wlog */
+ qavg >>= p->Wlog;
+
+ if (qavg > p->target_max && p->max_P <= MAX_P_MAX)
+ p->max_P += MAX_P_ALPHA(p->max_P); /* maxp = maxp + alpha */
+ else if (qavg < p->target_min && p->max_P >= MAX_P_MIN)
+ p->max_P = (p->max_P/10)*9; /* maxp = maxp * Beta */
+
+ max_p_delta = DIV_ROUND_CLOSEST(p->max_P, p->qth_delta);
+ p->max_P_reciprocal = reciprocal_value(max_p_delta);
+}
#endif
diff --git a/lib/reciprocal_div.c b/lib/reciprocal_div.c
index 6a3bd48..75510e9 100644
--- a/lib/reciprocal_div.c
+++ b/lib/reciprocal_div.c
@@ -1,5 +1,6 @@
#include <asm/div64.h>
#include <linux/reciprocal_div.h>
+#include <linux/export.h>
u32 reciprocal_value(u32 k)
{
@@ -7,3 +8,4 @@ u32 reciprocal_value(u32 k)
do_div(val, k);
return (u32)val;
}
+EXPORT_SYMBOL(reciprocal_value);
diff --git a/net/sched/sch_red.c b/net/sched/sch_red.c
index d617161..8f5a85b 100644
--- a/net/sched/sch_red.c
+++ b/net/sched/sch_red.c
@@ -39,6 +39,7 @@
struct red_sched_data {
u32 limit; /* HARD maximal queue length */
unsigned char flags;
+ struct timer_list adapt_timer;
struct red_parms parms;
struct red_stats stats;
struct Qdisc *qdisc;
@@ -161,6 +162,8 @@ static void red_reset(struct Qdisc *sch)
static void red_destroy(struct Qdisc *sch)
{
struct red_sched_data *q = qdisc_priv(sch);
+
+ del_timer_sync(&q->adapt_timer);
qdisc_destroy(q->qdisc);
}
@@ -209,6 +212,10 @@ static int red_change(struct Qdisc *sch, struct nlattr *opt)
ctl->Plog, ctl->Scell_log,
nla_data(tb[TCA_RED_STAB]));
+ del_timer(&q->adapt_timer);
+ if (ctl->flags & TC_RED_ADAPTATIVE)
+ mod_timer(&q->adapt_timer, jiffies + HZ/2);
+
if (!q->qdisc->q.qlen)
red_start_of_idle_period(&q->parms);
@@ -216,11 +223,24 @@ static int red_change(struct Qdisc *sch, struct nlattr *opt)
return 0;
}
+static inline void red_adaptative_timer(unsigned long arg)
+{
+ struct Qdisc *sch = (struct Qdisc *)arg;
+ struct red_sched_data *q = qdisc_priv(sch);
+ spinlock_t *root_lock = qdisc_lock(qdisc_root_sleeping(sch));
+
+ spin_lock(root_lock);
+ red_adaptative_algo(&q->parms);
+ mod_timer(&q->adapt_timer, jiffies + HZ/2);
+ spin_unlock(root_lock);
+}
+
static int red_init(struct Qdisc *sch, struct nlattr *opt)
{
struct red_sched_data *q = qdisc_priv(sch);
q->qdisc = &noop_qdisc;
+ setup_timer(&q->adapt_timer, red_adaptative_timer, (unsigned long)sch);
return red_change(sch, opt);
}
@@ -243,6 +263,7 @@ static int red_dump(struct Qdisc *sch, struct sk_buff *skb)
if (opts == NULL)
goto nla_put_failure;
NLA_PUT(skb, TCA_RED_PARMS, sizeof(opt), &opt);
+ NLA_PUT_U32(skb, TCA_RED_MAX_P, q->parms.max_P);
return nla_nest_end(skb, opts);
nla_put_failure:
Le jeudi 08 décembre 2011 à 09:21 -0800, Stephen Hemminger a écrit :
> Is this backward compatible for users that don't specify
> an adaptive parameter.
Yes, completely compatible. I made sure max_p was initted with value
provided by tc (Plog based)
I also plan the TCA_RED_MAX_P support at configuration time with finer
granularity than 1/(2^Plog), so a new tc user can ask for precise
probability instead of current few discrete values. [ for non adaptative
RED of course, since max_p can change very fast otherwise :) ]
From: Eric Dumazet <[email protected]>
Date: Thu, 08 Dec 2011 17:06:03 +0100
> Adaptative RED AQM for linux, based on paper from Sally FLoyd,
> Ramakrishna Gummadi, and Scott Shenker, August 2001 :
Applied.
Le lundi 05 décembre 2011 à 10:05 +0100, Dave Taht a écrit :
> And Eric dumazet also produced a preliminary patch a few weeks back
> that tied timestamping to before the head of a queue, but that tried to use a
> reserved field in the skb that appears from points A to Z is not guaranteed
> to be preserved.
It is guaranteed to be preserved, as it is part of skb->cb[], and
skb->cb[] content is private to each layer.
Here, Qdisc layer.
Adding a time limit is possible, all we need is a proper design and
implementation :)
Here is my suggestion :
Design a new tfifo/tred qdisc, with following properties :
Adaptative RED, (ECN enabled + head drop), but instead of using
bytes/packet qlen average, use time_in_queue average.
A single mandatory parameter to specify the tmin value
tmax default value would be tmin*3 (so that target avg is 2*tmin)
tlimit default value would be tmax*8
Adaptative RED dynamically adjusts maxp between 1% and 50%
[Using a timer, every 500ms, and AIMD]
tc qdisc add dev eth0 root tfifo tmin 1ms [tmax val] [tlimit val]
I volunteer to work on this new tfifo experimental qdisc :)
This can be used in replacement of pfifo/bfifo in a complex qdisc setup.
As it has RED included, it also can replace RED.
Please note that once skb leaves Qdisc and is delivered to device, any
time limit feature must be handled in the driver itself (if it can queue
packet for a long time period)
On Thu, 08 Dec 2011 17:06:03 +0100
Eric Dumazet <[email protected]> wrote:
> Changes against our RED implementation are :
>
> max_p is no longer a negative power of two (1/(2^Plog)), but a Q0.32
> fixed point number, to allow full range described in Adatative paper.
>
> To deliver a random number, we now use a reciprocal divide (thats really
> a multiply), but this operation is done once per marked/droped packet
> when in RED_BETWEEN_TRESH window, so added cost (compared to previous
> AND operation) is near zero.
>
> dump operation gives current max_p value in a new TCA_RED_MAX_P
> attribute.
>
> Example on a 10Mbit link :
>
> tc qdisc add dev $DEV parent 1:1 handle 10: est 1sec 8sec red \
> limit 400000 min 30000 max 90000 avpkt 1000 \
> burst 55 ecn adaptative bandwidth 10Mbit
>
> # tc -s -d qdisc show dev eth3
> ...
> qdisc red 10: parent 1:1 limit 400000b min 30000b max 90000b ecn
> adaptative ewma 5 max_p=0.113335 Scell_log 15
> Sent 50414282 bytes 34504 pkt (dropped 35, overlimits 1392 requeues 0)
> rate 9749Kbit 831pps backlog 72056b 16p requeues 0
> marked 1357 early 35 pdrop 0 other 0
>
>
> Signed-off-by: Eric Dumazet <[email protected]>
Is this backward compatible for users that don't specify
an adaptive parameter.
On Wed, 07 Dec 2011 11:15:38 +0100, Hagen Paul Pfeifer <[email protected]>
wrote:
> Question two: I submitted pfast_head_drop to drop more outdated data
> instead of new data. Back in time I thought TCP _may_ experience
benefits
> because more up-to-date SACK data packets are saved. Are there any other
> TCP advantages with head drop policy?
Small comment: pfast_head_drop was intended for UDP. E.g. OLSR messages,
regular GPS reporting and the like. TCP was not the focus at that time.
HGN
On Tue, Dec 6, 2011 at 3:03 AM, Adrian Chadd <[email protected]> wrote:
> Hi,
>
> For what it's worth, I've also been tinkering with time-in-queue for
> the wifi queue management in FreeBSD. I don't have anything public
> yet. What I have done actually seems to work quite well, when doing TX
> queue time-in-queue management. (I'm ignoring RX queue management for
> now.)
>
> Yes, I think a weighted random drop with both time-in-queue and queue
> depth would work well. I haven't sat down to model what it'd look like
> given some traffic profiles.
>
> I'll be sure to post some patches and results when I have them. :)
Well, if there is a way to add BQL sanely to the mac80211 or driver
layers and then apply some time in queue techniques at some edge
of some layer(s) down there in Linux, I'd love to know what and where
is good.
It might be simpler to discuss design ideas etc, in a more generic
and less noisy forum than netdev and linux-wireless, at least
for a while, getting to actual patches seems kind of hard at this point.
(at least to me. For all I know eric is finished already, and me, I haven't
finished grokking the paper he's leveraging some ideas on)
The little I understand about Linux's networking stack dwarfs the
little I know about BSD's, so I look forward to hearing about your
results as you get them, and that said, if you could provide some
pointers and insight into BSD's traffic shaping methods over on
bloat-devel, we do try to be all-architecture embracing in our
attempts to beat the bloat.
>
>
> Adrian
--
Dave T?ht
SKYPE: davetaht
US Tel: 1-239-829-5608
FR Tel: 0638645374
http://www.bufferbloat.net
Hi,
For what it's worth, I've also been tinkering with time-in-queue for
the wifi queue management in FreeBSD. I don't have anything public
yet. What I have done actually seems to work quite well, when doing TX
queue time-in-queue management. (I'm ignoring RX queue management for
now.)
Yes, I think a weighted random drop with both time-in-queue and queue
depth would work well. I haven't sat down to model what it'd look like
given some traffic profiles.
I'll be sure to post some patches and results when I have them. :)
Adrian
Le mercredi 07 décembre 2011 à 11:15 +0100, Hagen Paul Pfeifer a écrit :
> On Mon, 05 Dec 2011 11:59:34 +0100, Eric Dumazet wrote:
>
>
> > Adding a time limit is possible, all we need is a proper design and
> > implementation :)
> >
> > Here is my suggestion :
> >
> > Design a new tfifo/tred qdisc, with following properties :
> >
> > Adaptative RED, (ECN enabled + head drop), but instead of using
> > bytes/packet qlen average, use time_in_queue average.
>
> Question one: is anything wrong with sfb or choke as the basis, instead of
> RED?
>
RED is the module to bring EWMA stuff, it seems natural to start with
it. Please note that choke has a RED module too.
Then later, we can add time limit stuff to other Qdisc if needed, its a
plug anyway. But is there any meaning to compute a global EWMA after
SFB/SFQ packet classification ?
> Question two: I submitted pfast_head_drop to drop more outdated data
> instead of new data. Back in time I thought TCP _may_ experience benefits
> because more up-to-date SACK data packets are saved. Are there any other
> TCP advantages with head drop policy?
>
Note that head drop is a consequence of time limit idea on top of FIFO,
since only at dequeue time, we compute the delta between current time
and enqueue time, and we drop/mark the (head) packet if time exceeds our
current limit.
In general, being able to drop/mark firsts packets in queue instead of
last ones can let TCP sender be notified of congestion much earlier than
a tail drop. (We gain the time to transmit whole packets in queue before
receiver can report in its ACK the congestion back to sender)
On Mon, 05 Dec 2011 11:59:34 +0100, Eric Dumazet wrote:
> Adding a time limit is possible, all we need is a proper design and
> implementation :)
>
> Here is my suggestion :
>
> Design a new tfifo/tred qdisc, with following properties :
>
> Adaptative RED, (ECN enabled + head drop), but instead of using
> bytes/packet qlen average, use time_in_queue average.
Question one: is anything wrong with sfb or choke as the basis, instead of
RED?
Question two: I submitted pfast_head_drop to drop more outdated data
instead of new data. Back in time I thought TCP _may_ experience benefits
because more up-to-date SACK data packets are saved. Are there any other
TCP advantages with head drop policy?
Hagen