2019-07-23 08:19:32

by Rasmus Villemoes

[permalink] [raw]
Subject: [PATCH net-next] net: dsa: mv88e6xxx: avoid some redundant vtu load/purge operations

We have an ERPS (Ethernet Ring Protection Switching) setup involving
mv88e6250 switches which we're in the process of switching to a BSP
based on the mainline driver. Breaking any link in the ring works as
expected, with the ring reconfiguring itself quickly and traffic
continuing with almost no noticable drops. However, when plugging back
the cable, we see 5+ second stalls.

This has been tracked down to the userspace application in charge of
the protocol missing a few CCM messages on the good link (the one that
was not unplugged), causing it to broadcast a "signal fail". That
message eventually reaches its link partner, which responds by
blocking the port. Meanwhile, the first node has continued to block
the port with the just plugged-in cable, breaking the network. And the
reason for those missing CCM messages has in turn been tracked down to
the VTU apparently being too busy servicing load/purge operations that
the normal lookups are delayed.

Initial state, the link between C and D is blocked in software.

_____________________
/ \
| |
A ----- B ----- C *---- D

Unplug the cable between C and D.

_____________________
/ \
| |
A ----- B ----- C * * D

Reestablish the link between C and D.
_____________________
/ \
| |
A ----- B ----- C *---- D

Somehow, enough VTU/ATU operations happen inside C that prevents
the application from receving the CCM messages from B in a timely
manner, so a Signal Fail message is sent by C. When B receives
that, it responds by blocking its port.

_____________________
/ \
| |
A ----- B *---* C *---- D

Very shortly after this, the signal fail condition clears on the
BC link (some CCM messages finally make it through), so C
unblocks the port. However, a guard timer inside B prevents it
from removing the blocking before 5 seconds have elapsed.

It is not unlikely that our userspace ERPS implementation could be
smarter and/or is simply buggy. However, this patch fixes the symptoms
we see, and is a small optimization that should not break anything
(knock wood). The idea is simply to avoid doing an VTU load of an
entry identical to the one already present. To do that, we need to
know whether mv88e6xxx_vtu_get() actually found an existing entry, or
has just prepared a struct mv88e6xxx_vtu_entry for us to load. To that
end, let vlan->valid be an output parameter. The other two callers of
mv88e6xxx_vtu_get() are not affected by this patch since they pass
new=false.

Signed-off-by: Rasmus Villemoes <[email protected]>
---
drivers/net/dsa/mv88e6xxx/chip.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 6b17cd961d06..2e500428670c 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -1423,7 +1423,6 @@ static int mv88e6xxx_vtu_get(struct mv88e6xxx_chip *chip, u16 vid,

/* Initialize a fresh VLAN entry */
memset(entry, 0, sizeof(*entry));
- entry->valid = true;
entry->vid = vid;

/* Exclude all ports */
@@ -1618,6 +1617,9 @@ static int _mv88e6xxx_port_vlan_add(struct mv88e6xxx_chip *chip, int port,
if (err)
return err;

+ if (vlan.valid && vlan.member[port] == member)
+ return 0;
+ vlan.valid = true;
vlan.member[port] = member;

err = mv88e6xxx_vtu_loadpurge(chip, &vlan);
--
2.20.1


2019-07-24 00:05:06

by Rasmus Villemoes

[permalink] [raw]
Subject: Re: [PATCH net-next] net: dsa: mv88e6xxx: avoid some redundant vtu load/purge operations

On 23/07/2019 15.40, Andrew Lunn wrote:
> On Mon, Jul 22, 2019 at 11:37:26PM +0000, Rasmus Villemoes wrote:
>> We have an ERPS (Ethernet Ring Protection Switching) setup involving
>> mv88e6250 switches which we're in the process of switching to a BSP
>> based on the mainline driver. Breaking any link in the ring works as
>> expected, with the ring reconfiguring itself quickly and traffic
>> continuing with almost no noticable drops. However, when plugging back
>> the cable, we see 5+ second stalls.
>
> Hi Rasmus
>
> I would prefer Vivien reviews this patch. But he is away at the
> moment. Are you O.K. to wait a few days?

Sure, no rush.

Rasmus

2019-07-24 02:17:50

by Andrew Lunn

[permalink] [raw]
Subject: Re: [PATCH net-next] net: dsa: mv88e6xxx: avoid some redundant vtu load/purge operations

On Mon, Jul 22, 2019 at 11:37:26PM +0000, Rasmus Villemoes wrote:
> We have an ERPS (Ethernet Ring Protection Switching) setup involving
> mv88e6250 switches which we're in the process of switching to a BSP
> based on the mainline driver. Breaking any link in the ring works as
> expected, with the ring reconfiguring itself quickly and traffic
> continuing with almost no noticable drops. However, when plugging back
> the cable, we see 5+ second stalls.

Hi Rasmus

I would prefer Vivien reviews this patch. But he is away at the
moment. Are you O.K. to wait a few days?

Andrew

2019-07-27 20:23:09

by David Miller

[permalink] [raw]
Subject: Re: [PATCH net-next] net: dsa: mv88e6xxx: avoid some redundant vtu load/purge operations


Reminder that we need a review from Vivien on this.

Thanks.

2019-07-29 19:29:04

by David Miller

[permalink] [raw]
Subject: Re: [PATCH net-next] net: dsa: mv88e6xxx: avoid some redundant vtu load/purge operations


Andrew, I really can't wait forever to get this patch reviewed.

It's been an entire week already :-(

2019-07-30 03:20:27

by Vivien Didelot

[permalink] [raw]
Subject: Re: [PATCH net-next] net: dsa: mv88e6xxx: avoid some redundant vtu load/purge operations

Hi Rasmus,

On Mon, 22 Jul 2019 23:37:26 +0000, Rasmus Villemoes <[email protected]> wrote:
> We have an ERPS (Ethernet Ring Protection Switching) setup involving
> mv88e6250 switches which we're in the process of switching to a BSP
> based on the mainline driver. Breaking any link in the ring works as
> expected, with the ring reconfiguring itself quickly and traffic
> continuing with almost no noticable drops. However, when plugging back
> the cable, we see 5+ second stalls.
>
> This has been tracked down to the userspace application in charge of
> the protocol missing a few CCM messages on the good link (the one that
> was not unplugged), causing it to broadcast a "signal fail". That
> message eventually reaches its link partner, which responds by
> blocking the port. Meanwhile, the first node has continued to block
> the port with the just plugged-in cable, breaking the network. And the
> reason for those missing CCM messages has in turn been tracked down to
> the VTU apparently being too busy servicing load/purge operations that
> the normal lookups are delayed.
>
> Initial state, the link between C and D is blocked in software.
>
> _____________________
> / \
> | |
> A ----- B ----- C *---- D
>
> Unplug the cable between C and D.
>
> _____________________
> / \
> | |
> A ----- B ----- C * * D
>
> Reestablish the link between C and D.
> _____________________
> / \
> | |
> A ----- B ----- C *---- D
>
> Somehow, enough VTU/ATU operations happen inside C that prevents
> the application from receving the CCM messages from B in a timely
> manner, so a Signal Fail message is sent by C. When B receives
> that, it responds by blocking its port.
>
> _____________________
> / \
> | |
> A ----- B *---* C *---- D
>
> Very shortly after this, the signal fail condition clears on the
> BC link (some CCM messages finally make it through), so C
> unblocks the port. However, a guard timer inside B prevents it
> from removing the blocking before 5 seconds have elapsed.
>
> It is not unlikely that our userspace ERPS implementation could be
> smarter and/or is simply buggy. However, this patch fixes the symptoms
> we see, and is a small optimization that should not break anything
> (knock wood). The idea is simply to avoid doing an VTU load of an
> entry identical to the one already present. To do that, we need to
> know whether mv88e6xxx_vtu_get() actually found an existing entry, or
> has just prepared a struct mv88e6xxx_vtu_entry for us to load. To that
> end, let vlan->valid be an output parameter. The other two callers of
> mv88e6xxx_vtu_get() are not affected by this patch since they pass
> new=false.
>
> Signed-off-by: Rasmus Villemoes <[email protected]>
> ---
> drivers/net/dsa/mv88e6xxx/chip.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
> index 6b17cd961d06..2e500428670c 100644
> --- a/drivers/net/dsa/mv88e6xxx/chip.c
> +++ b/drivers/net/dsa/mv88e6xxx/chip.c
> @@ -1423,7 +1423,6 @@ static int mv88e6xxx_vtu_get(struct mv88e6xxx_chip *chip, u16 vid,
>
> /* Initialize a fresh VLAN entry */
> memset(entry, 0, sizeof(*entry));
> - entry->valid = true;
> entry->vid = vid;
>
> /* Exclude all ports */
> @@ -1618,6 +1617,9 @@ static int _mv88e6xxx_port_vlan_add(struct mv88e6xxx_chip *chip, int port,
> if (err)
> return err;
>
> + if (vlan.valid && vlan.member[port] == member)
> + return 0;
> + vlan.valid = true;
> vlan.member[port] = member;
>
> err = mv88e6xxx_vtu_loadpurge(chip, &vlan);

I'd prefer not to use the mv88e6xxx_vtu_entry structure for output
parameters, this can be confusing. As you correctly mentioned, this
initialization is only done for _mv88e6xxx_port_vlan_add, so I'll
prepare a patch which gets rid of the boolean parameter and move that
logic up.


Thanks,
Vivien