2024-04-08 17:26:25

by Breno Leitao

[permalink] [raw]
Subject: [PATCH v2 0/4] net : dqs: optimize if stall threshold is not set

Here are four patches aimed at enhancing the Dynamic Queue Limit (DQL)
subsystem within the networking stack.

The first two commits involve code refactoring, while the third patch
introduces the actual change. The fourth patch just improves the cache
locality.

Typically, when DQL is enabled, stall information is always populated
through dql_queue_stall(). However, this information is only necessary
if a stall threshold is set, which is stored in struct dql->stall_thrs.

Although dql_queue_stall() is relatively inexpensive, it is not entirely
free due to memory barriers and similar overheads.

To optimize performance, refrain from calling dql_queue_stall() when no
stall threshold is set, thus avoiding the processing of unnecessary
information.

Changelog:

v1:
* https://lore.kernel.org/all/[email protected]/
v2:
* Moved the stall_thrs to the very first cache line, as a new
patch. Suggested by Eric Dumazet.


Breno Leitao (4):
net: dql: Avoid calling BUG() when WARN() is enough
net: dql: Separate queue function responsibilities
net: dql: Optimize stall information population
net: dqs: make struct dql more cache efficient

include/linux/dynamic_queue_limits.h | 50 +++++++++++++++++-----------
1 file changed, 30 insertions(+), 20 deletions(-)

--
2.43.0



2024-04-08 17:26:38

by Breno Leitao

[permalink] [raw]
Subject: [PATCH net-next v2 1/4] net: dql: Avoid calling BUG() when WARN() is enough

If the dql_queued() function receives an invalid argument, WARN about it
and continue, instead of crashing the kernel.

This was raised by checkpatch, when I am refactoring this code (see
following patch/commit)

WARNING: Do not crash the kernel unless it is absolutely unavoidable--use WARN_ON_ONCE() plus recovery code (if feasible) instead of BUG() or variants

Signed-off-by: Breno Leitao <[email protected]>
---
include/linux/dynamic_queue_limits.h | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/dynamic_queue_limits.h b/include/linux/dynamic_queue_limits.h
index 5693a4be0d9a..ff9c65841ae8 100644
--- a/include/linux/dynamic_queue_limits.h
+++ b/include/linux/dynamic_queue_limits.h
@@ -91,7 +91,8 @@ static inline void dql_queued(struct dql *dql, unsigned int count)
{
unsigned long map, now, now_hi, i;

- BUG_ON(count > DQL_MAX_OBJECT);
+ if (WARN_ON_ONCE(count > DQL_MAX_OBJECT))
+ return;

dql->last_obj_cnt = count;

--
2.43.0


2024-04-08 17:26:53

by Breno Leitao

[permalink] [raw]
Subject: [PATCH net-next v2 2/4] net: dql: Separate queue function responsibilities

The dql_queued() function currently handles both queuing object counts
and populating bitmaps for reporting stalls.

This commit splits the bitmap population into a separate function,
allowing for conditional invocation in scenarios where the feature is
disabled.

This refactor maintains functionality while improving code
organization.

Signed-off-by: Breno Leitao <[email protected]>
---
include/linux/dynamic_queue_limits.h | 44 ++++++++++++++++------------
1 file changed, 25 insertions(+), 19 deletions(-)

diff --git a/include/linux/dynamic_queue_limits.h b/include/linux/dynamic_queue_limits.h
index ff9c65841ae8..9980df0b7247 100644
--- a/include/linux/dynamic_queue_limits.h
+++ b/include/linux/dynamic_queue_limits.h
@@ -83,28 +83,11 @@ struct dql {
#define DQL_MAX_OBJECT (UINT_MAX / 16)
#define DQL_MAX_LIMIT ((UINT_MAX / 2) - DQL_MAX_OBJECT)

-/*
- * Record number of objects queued. Assumes that caller has already checked
- * availability in the queue with dql_avail.
- */
-static inline void dql_queued(struct dql *dql, unsigned int count)
+/* Populate the bitmap to be processed later in dql_check_stall() */
+static inline void dql_queue_stall(struct dql *dql)
{
unsigned long map, now, now_hi, i;

- if (WARN_ON_ONCE(count > DQL_MAX_OBJECT))
- return;
-
- dql->last_obj_cnt = count;
-
- /* We want to force a write first, so that cpu do not attempt
- * to get cache line containing last_obj_cnt, num_queued, adj_limit
- * in Shared state, but directly does a Request For Ownership
- * It is only a hint, we use barrier() only.
- */
- barrier();
-
- dql->num_queued += count;
-
now = jiffies;
now_hi = now / BITS_PER_LONG;

@@ -134,6 +117,29 @@ static inline void dql_queued(struct dql *dql, unsigned int count)
WRITE_ONCE(DQL_HIST_ENT(dql, now_hi), map | BIT_MASK(now));
}

+/*
+ * Record number of objects queued. Assumes that caller has already checked
+ * availability in the queue with dql_avail.
+ */
+static inline void dql_queued(struct dql *dql, unsigned int count)
+{
+ if (WARN_ON_ONCE(count > DQL_MAX_OBJECT))
+ return;
+
+ dql->last_obj_cnt = count;
+
+ /* We want to force a write first, so that cpu do not attempt
+ * to get cache line containing last_obj_cnt, num_queued, adj_limit
+ * in Shared state, but directly does a Request For Ownership
+ * It is only a hint, we use barrier() only.
+ */
+ barrier();
+
+ dql->num_queued += count;
+
+ dql_queue_stall(dql);
+}
+
/* Returns how many objects can be queued, < 0 indicates over limit. */
static inline int dql_avail(const struct dql *dql)
{
--
2.43.0


2024-04-08 17:27:08

by Breno Leitao

[permalink] [raw]
Subject: [PATCH net-next v2 3/4] net: dql: Optimize stall information population

When Dynamic Queue Limit (DQL) is set, it always populate stall
information through dql_queue_stall(). However, this information is
only necessary if a stall threshold is set, stored in struct
dql->stall_thrs.

dql_queue_stall() is cheap, but not free, since it does have memory
barriers and so forth.

Do not call dql_queue_stall() if there is no stall threshold set, and
save some CPU cycles.

Signed-off-by: Breno Leitao <[email protected]>
---
include/linux/dynamic_queue_limits.h | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/dynamic_queue_limits.h b/include/linux/dynamic_queue_limits.h
index 9980df0b7247..869afb800ea1 100644
--- a/include/linux/dynamic_queue_limits.h
+++ b/include/linux/dynamic_queue_limits.h
@@ -137,7 +137,9 @@ static inline void dql_queued(struct dql *dql, unsigned int count)

dql->num_queued += count;

- dql_queue_stall(dql);
+ /* Only populate stall information if the threshold is set */
+ if (READ_ONCE(dql->stall_thrs))
+ dql_queue_stall(dql);
}

/* Returns how many objects can be queued, < 0 indicates over limit. */
--
2.43.0


2024-04-08 17:27:22

by Breno Leitao

[permalink] [raw]
Subject: [PATCH net-next v2 4/4] net: dqs: make struct dql more cache efficient

With the previous change, struct dqs->stall_thrs will be in the hot path
(at queue side), even if DQS is disabled.

The other fields accessed in this function (last_obj_cnt and num_queued)
are in the first cache line, let's move this field (stall_thrs) to the
very first cache line, since there is a hole there.

This does not change the structure size, since it moves an short (2
bytes) to 4-bytes whole in the first cache line.

This is the new structure format now:

struct dql {
unsigned int num_queued; /* Total ever queued */
unsigned int last_obj_cnt; /* Count at last queuing */
..
short unsigned int stall_thrs; /* 12 2 */
/* XXX 2 bytes hole, try to pack */
..
/* --- cacheline 1 boundary (64 bytes) --- */
..
/* Longest stall detected, reported to user */
short unsigned int stall_max; /* 100 2 */
/* XXX 2 bytes hole, try to pack */
};

Suggested-by: Eric Dumazet <[email protected]>
Signed-off-by: Breno Leitao <[email protected]>
---
include/linux/dynamic_queue_limits.h | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/include/linux/dynamic_queue_limits.h b/include/linux/dynamic_queue_limits.h
index 869afb800ea1..281298e77a15 100644
--- a/include/linux/dynamic_queue_limits.h
+++ b/include/linux/dynamic_queue_limits.h
@@ -50,6 +50,9 @@ struct dql {
unsigned int adj_limit; /* limit + num_completed */
unsigned int last_obj_cnt; /* Count at last queuing */

+ /* Stall threshold (in jiffies), defined by user */
+ unsigned short stall_thrs;
+
unsigned long history_head; /* top 58 bits of jiffies */
/* stall entries, a bit per entry */
unsigned long history[DQL_HIST_LEN];
@@ -71,8 +74,6 @@ struct dql {
unsigned int min_limit; /* Minimum limit */
unsigned int slack_hold_time; /* Time to measure slack */

- /* Stall threshold (in jiffies), defined by user */
- unsigned short stall_thrs;
/* Longest stall detected, reported to user */
unsigned short stall_max;
unsigned long last_reap; /* Last reap (in jiffies) */
--
2.43.0


2024-04-10 00:22:03

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH net-next v2 4/4] net: dqs: make struct dql more cache efficient

On Mon, 8 Apr 2024 10:25:56 -0700 Breno Leitao wrote:
> With the previous change, struct dqs->stall_thrs will be in the hot path
> (at queue side), even if DQS is disabled.
>
> The other fields accessed in this function (last_obj_cnt and num_queued)
> are in the first cache line, let's move this field (stall_thrs) to the
> very first cache line, since there is a hole there.
>
> This does not change the structure size, since it moves an short (2
> bytes) to 4-bytes whole in the first cache line.

Doesn't this move the cache line bouncing problem to the other side?
Eric said "copy" I read that as "have two fields with the same value".
I think it's single digit number of alu instructions we'd be saving
here, not super convinced patch 3 is the right trade off...

2024-04-10 14:02:07

by Breno Leitao

[permalink] [raw]
Subject: Re: [PATCH net-next v2 4/4] net: dqs: make struct dql more cache efficient

On Tue, Apr 09, 2024 at 05:21:49PM -0700, Jakub Kicinski wrote:
> On Mon, 8 Apr 2024 10:25:56 -0700 Breno Leitao wrote:
> > With the previous change, struct dqs->stall_thrs will be in the hot path
> > (at queue side), even if DQS is disabled.
> >
> > The other fields accessed in this function (last_obj_cnt and num_queued)
> > are in the first cache line, let's move this field (stall_thrs) to the
> > very first cache line, since there is a hole there.
> >
> > This does not change the structure size, since it moves an short (2
> > bytes) to 4-bytes whole in the first cache line.
>
> Doesn't this move the cache line bouncing problem to the other side?

I think so. Looking at dql_check_stall(), it only uses fields in the
second cache line, except now 'dql->stall_thrs' that is in the first
cache line.

> Eric said "copy" I read that as "have two fields with the same value".

Sorry, I misunderstood it. I can create two fields, and update them
together at the only place where they will be updated
(bql_set_stall_thrs).

> I think it's single digit number of alu instructions we'd be saving
> here, not super convinced patch 3 is the right trade off...

Right. I was more concerned about the write barriers (smp_wmb()) inside
the loop which happen quite frequently.

But, if you think that the this is not the right approach, I can drop
this whole patchset. Do you think a profiler will us here?

Thanks!

2024-04-11 01:45:59

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH net-next v2 4/4] net: dqs: make struct dql more cache efficient

On Wed, 10 Apr 2024 06:52:56 -0700 Breno Leitao wrote:
> > Doesn't this move the cache line bouncing problem to the other side?
>
> I think so. Looking at dql_check_stall(), it only uses fields in the
> second cache line, except now 'dql->stall_thrs' that is in the first
> cache line.

We do read num_queued at the beginning of dql_completed().
So maybe we we move the read of the threshold there we will be fine.