2019-11-14 11:44:37

by Jose Abreu

[permalink] [raw]
Subject: [PATCH v2 net-next 0/7] net: stmmac: CPU Performance Improvements

CPU Performance improvements for stmmac. Please check bellow for results
before and after the series.

Patch 1/7, allows RX Interrupt on Completion to be disabled and only use the
RX HW Watchdog.

Patch 2/7, setups the default RX coalesce settings instead of using the
minimum value.

Patch 3/7 and 4/7, removes the uneeded computations for RX Flow Control
activation/de-activation, on some cases.

Patch 5/7, tunes-up the default coalesce settings.

Patch 6/7, re-works the TX coalesce timer activation logic.

Patch 7/7, removes the now uneeded TBU interrupt.

NetPerf UDP Results:
--------------------

Socket Message Elapsed Messages CPU Service
Size Size Time Okay Errors Throughput Util Demand
bytes bytes secs # # 10^6bits/sec % SS us/KB
--- [email protected]: Before
212992 1400 10.00 2100620 0 2351.7 36.69 5.112
212992 10.00 2100539 2351.6 26.18 3.648
--- [email protected]: After
212992 1400 10.00 2108972 0 2361.5 21.73 3.015
212992 10.00 2097038 2348.1 19.21 2.666

--- GMAC5@1G: Before
212992 1400 10.00 786000 0 880.2 34.71 12.923
212992 10.00 786000 880.2 23.42 8.719
--- GMAC5@1G: After
212992 1400 10.00 842648 0 943.7 14.12 4.903
212992 10.00 842648 943.7 12.73 4.418


Perf TCP Results on RX Path:
----------------------------
--- [email protected]: Before
22.51% swapper [stmmac] [k] dwxgmac2_dma_interrupt
10.82% swapper [stmmac] [k] dwxgmac2_host_mtl_irq_status
5.21% swapper [stmmac] [k] dwxgmac2_host_irq_status
4.67% swapper [stmmac] [k] dwxgmac3_safety_feat_irq_status
3.63% swapper [kernel.kallsyms] [k] stack_trace_consume_entry
2.74% iperf3 [kernel.kallsyms] [k] copy_user_enhanced_fast_string
2.52% swapper [kernel.kallsyms] [k] update_stack_state
1.94% ksoftirqd/0 [stmmac] [k] dwxgmac2_dma_interrupt
1.45% iperf3 [kernel.kallsyms] [k] queued_spin_lock_slowpath
1.26% swapper [kernel.kallsyms] [k] create_object
--- [email protected]: After
7.43% swapper [kernel.kallsyms] [k] stack_trace_consume_entry
5.86% swapper [stmmac] [k] dwxgmac2_dma_interrupt
5.68% swapper [kernel.kallsyms] [k] update_stack_state
4.71% iperf3 [kernel.kallsyms] [k] copy_user_enhanced_fast_string
2.88% swapper [kernel.kallsyms] [k] create_object
2.69% swapper [stmmac] [k] dwxgmac2_host_mtl_irq_status
2.61% swapper [stmmac] [k] stmmac_napi_poll_rx
2.52% swapper [kernel.kallsyms] [k] unwind_next_frame.part.4
1.48% swapper [kernel.kallsyms] [k] unwind_get_return_address
1.38% swapper [kernel.kallsyms] [k] arch_stack_walk

--- GMAC5@1G: Before
31.29% swapper [stmmac] [k] dwmac4_dma_interrupt
14.57% swapper [stmmac] [k] dwmac4_irq_mtl_status
10.66% swapper [stmmac] [k] dwmac4_irq_status
1.97% swapper [kernel.kallsyms] [k] stack_trace_consume_entry
1.73% iperf3 [kernel.kallsyms] [k] copy_user_enhanced_fast_string
1.59% swapper [kernel.kallsyms] [k] update_stack_state
1.15% iperf3 [kernel.kallsyms] [k] do_syscall_64
1.01% ksoftirqd/0 [stmmac] [k] dwmac4_dma_interrupt
0.89% swapper [kernel.kallsyms] [k] __default_send_IPI_dest_field
0.75% swapper [stmmac] [k] stmmac_napi_poll_rx
--- GMAC5@1G: After
6.70% swapper [kernel.kallsyms] [k] stack_trace_consume_entry
5.79% swapper [stmmac] [k] dwmac4_dma_interrupt
5.29% swapper [kernel.kallsyms] [k] update_stack_state
3.52% iperf3 [kernel.kallsyms] [k] copy_user_enhanced_fast_string
2.83% swapper [stmmac] [k] dwmac4_irq_mtl_status
2.62% swapper [kernel.kallsyms] [k] create_object
2.46% swapper [stmmac] [k] stmmac_napi_poll_rx
2.32% swapper [kernel.kallsyms] [k] unwind_next_frame.part.4
2.19% swapper [stmmac] [k] dwmac4_irq_status
1.39% swapper [kernel.kallsyms] [k] unwind_get_return_address

---
Cc: Giuseppe Cavallaro <[email protected]>
Cc: Alexandre Torgue <[email protected]>
Cc: Jose Abreu <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Maxime Coquelin <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---

Jose Abreu (7):
net: stmmac: Do not set RX IC bit if RX Coalesce is zero
net: stmmac: Setup a default RX Coalesce value instead of the minimum
net: stmmac: gmac4+: Remove uneeded computation for RFA/RFD
net: stmmac: xgmac: Remove uneeded computation for RFA/RFD
net: stmmac: Tune-up default coalesce settings
net: stmmac: Rework TX Coalesce logic
net: stmmac: xgmac: Do not enable TBU interrupt

drivers/net/ethernet/stmicro/stmmac/common.h | 5 +-
drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c | 14 +---
drivers/net/ethernet/stmicro/stmmac/dwxgmac2.h | 2 +-
drivers/net/ethernet/stmicro/stmmac/dwxgmac2_dma.c | 14 +---
drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 74 +++++++++++++++-------
5 files changed, 59 insertions(+), 50 deletions(-)

--
2.7.4


2019-11-14 11:44:59

by Jose Abreu

[permalink] [raw]
Subject: [PATCH v2 net-next 5/7] net: stmmac: Tune-up default coalesce settings

Tune-up the defalt coalesce settings for optimal values. This gives the
best performance in most of the use-cases.

Signed-off-by: Jose Abreu <[email protected]>

---
Cc: Giuseppe Cavallaro <[email protected]>
Cc: Alexandre Torgue <[email protected]>
Cc: Jose Abreu <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Maxime Coquelin <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
drivers/net/ethernet/stmicro/stmmac/common.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h b/drivers/net/ethernet/stmicro/stmmac/common.h
index 309ea12ea61f..b210e987a1db 100644
--- a/drivers/net/ethernet/stmicro/stmmac/common.h
+++ b/drivers/net/ethernet/stmicro/stmmac/common.h
@@ -253,8 +253,8 @@ struct stmmac_safety_stats {
#define STMMAC_COAL_TX_TIMER 1000
#define STMMAC_MAX_COAL_TX_TICK 100000
#define STMMAC_TX_MAX_FRAMES 256
-#define STMMAC_TX_FRAMES 1
-#define STMMAC_RX_FRAMES 25
+#define STMMAC_TX_FRAMES 25
+#define STMMAC_RX_FRAMES 0

/* Packets types */
enum packets_types {
--
2.7.4

2019-11-14 11:45:19

by Jose Abreu

[permalink] [raw]
Subject: [PATCH v2 net-next 6/7] net: stmmac: Rework TX Coalesce logic

Coalesce logic currently increments the number of packets and sets the
IC bit when the coalesced packets have passed a given limit. This does
not reflect very well what coalesce was meant for as we can have a large
number of packets that are coalesced and then a single one, sent later
on that has the IC bit.

Rework the logic so that it coalesces only upon a limit of packets and
sets the IC bit for large number of packets.

Signed-off-by: Jose Abreu <[email protected]>

---
Cc: Giuseppe Cavallaro <[email protected]>
Cc: Alexandre Torgue <[email protected]>
Cc: Jose Abreu <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Maxime Coquelin <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 61 ++++++++++++++++-------
1 file changed, 42 insertions(+), 19 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 400fbb727fd5..4ba250a9008f 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -2916,16 +2916,17 @@ static netdev_tx_t stmmac_tso_xmit(struct sk_buff *skb, struct net_device *dev)
struct stmmac_priv *priv = netdev_priv(dev);
int nfrags = skb_shinfo(skb)->nr_frags;
u32 queue = skb_get_queue_mapping(skb);
+ unsigned int first_entry, tx_packets;
+ int tmp_pay_len = 0, first_tx;
struct stmmac_tx_queue *tx_q;
- unsigned int first_entry;
u8 proto_hdr_len, hdr;
- int tmp_pay_len = 0;
+ bool has_vlan, set_ic;
u32 pay_len, mss;
dma_addr_t des;
- bool has_vlan;
int i;

tx_q = &priv->tx_queue[queue];
+ first_tx = tx_q->cur_tx;

/* Compute header lengths */
if (skb_shinfo(skb)->gso_type & SKB_GSO_UDP_L4) {
@@ -3033,16 +3034,27 @@ static netdev_tx_t stmmac_tso_xmit(struct sk_buff *skb, struct net_device *dev)
tx_q->tx_skbuff[tx_q->cur_tx] = skb;

/* Manage tx mitigation */
- tx_q->tx_count_frames += nfrags + 1;
- if (likely(priv->tx_coal_frames > tx_q->tx_count_frames) &&
- !((skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP) &&
- priv->hwts_tx_en)) {
- stmmac_tx_timer_arm(priv, queue);
- } else {
+ tx_packets = (tx_q->cur_tx + 1) - first_tx;
+ tx_q->tx_count_frames += tx_packets;
+
+ if ((skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP) && priv->hwts_tx_en)
+ set_ic = true;
+ else if (!priv->tx_coal_frames)
+ set_ic = false;
+ else if (tx_packets > priv->tx_coal_frames)
+ set_ic = true;
+ else if ((tx_q->tx_count_frames % priv->tx_coal_frames) < tx_packets)
+ set_ic = true;
+ else
+ set_ic = false;
+
+ if (set_ic) {
desc = &tx_q->dma_tx[tx_q->cur_tx];
tx_q->tx_count_frames = 0;
stmmac_set_tx_ic(priv, desc);
priv->xstats.tx_set_ic_bit++;
+ } else {
+ stmmac_tx_timer_arm(priv, queue);
}

/* We've used all descriptors we need for this skb, however,
@@ -3133,6 +3145,7 @@ static netdev_tx_t stmmac_tso_xmit(struct sk_buff *skb, struct net_device *dev)
*/
static netdev_tx_t stmmac_xmit(struct sk_buff *skb, struct net_device *dev)
{
+ unsigned int first_entry, tx_packets, enh_desc;
struct stmmac_priv *priv = netdev_priv(dev);
unsigned int nopaged_len = skb_headlen(skb);
int i, csum_insertion = 0, is_jumbo = 0;
@@ -3141,13 +3154,12 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, struct net_device *dev)
int gso = skb_shinfo(skb)->gso_type;
struct dma_desc *desc, *first;
struct stmmac_tx_queue *tx_q;
- unsigned int first_entry;
- unsigned int enh_desc;
+ bool has_vlan, set_ic;
+ int entry, first_tx;
dma_addr_t des;
- bool has_vlan;
- int entry;

tx_q = &priv->tx_queue[queue];
+ first_tx = tx_q->cur_tx;

if (priv->tx_path_in_lpi_mode)
stmmac_disable_eee_mode(priv);
@@ -3241,12 +3253,21 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, struct net_device *dev)
* This approach takes care about the fragments: desc is the first
* element in case of no SG.
*/
- tx_q->tx_count_frames += nfrags + 1;
- if (likely(priv->tx_coal_frames > tx_q->tx_count_frames) &&
- !((skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP) &&
- priv->hwts_tx_en)) {
- stmmac_tx_timer_arm(priv, queue);
- } else {
+ tx_packets = (entry + 1) - first_tx;
+ tx_q->tx_count_frames += tx_packets;
+
+ if ((skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP) && priv->hwts_tx_en)
+ set_ic = true;
+ else if (!priv->tx_coal_frames)
+ set_ic = false;
+ else if (tx_packets > priv->tx_coal_frames)
+ set_ic = true;
+ else if ((tx_q->tx_count_frames % priv->tx_coal_frames) < tx_packets)
+ set_ic = true;
+ else
+ set_ic = false;
+
+ if (set_ic) {
if (likely(priv->extend_desc))
desc = &tx_q->dma_etx[entry].basic;
else
@@ -3255,6 +3276,8 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, struct net_device *dev)
tx_q->tx_count_frames = 0;
stmmac_set_tx_ic(priv, desc);
priv->xstats.tx_set_ic_bit++;
+ } else {
+ stmmac_tx_timer_arm(priv, queue);
}

/* We've used all descriptors we need for this skb, however,
--
2.7.4

2019-11-14 11:46:05

by Jose Abreu

[permalink] [raw]
Subject: [PATCH v2 net-next 1/7] net: stmmac: Do not set RX IC bit if RX Coalesce is zero

We may only want to use the RX Watchdog so lets check if RX Coalesce
settings are non-zero and only set the RX Interrupt on Completion bit if
its not.

Signed-off-by: Jose Abreu <[email protected]>

---
Cc: Giuseppe Cavallaro <[email protected]>
Cc: Alexandre Torgue <[email protected]>
Cc: Jose Abreu <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Maxime Coquelin <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 39b4efd521f9..7939ef7e23b7 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -3440,7 +3440,11 @@ static inline void stmmac_rx_refill(struct stmmac_priv *priv, u32 queue)
rx_q->rx_count_frames += priv->rx_coal_frames;
if (rx_q->rx_count_frames > priv->rx_coal_frames)
rx_q->rx_count_frames = 0;
- use_rx_wd = priv->use_riwt && rx_q->rx_count_frames;
+
+ use_rx_wd = !priv->rx_coal_frames;
+ use_rx_wd |= rx_q->rx_count_frames > 0;
+ if (!priv->use_riwt)
+ use_rx_wd = false;

dma_wmb();
stmmac_set_rx_owner(priv, p, use_rx_wd);
--
2.7.4

2019-11-15 20:31:48

by David Miller

[permalink] [raw]
Subject: Re: [PATCH v2 net-next 0/7] net: stmmac: CPU Performance Improvements

From: Jose Abreu <[email protected]>
Date: Thu, 14 Nov 2019 12:42:44 +0100

> CPU Performance improvements for stmmac. Please check bellow for results
> before and after the series.
...

Series applied, thanks.