2024-02-15 17:26:30

by Praveen Kannoju

[permalink] [raw]
Subject: [PATCH RFC] bonding: rate-limit bonding driver inspect messages

Rate limit bond driver log messages, to prevent a log flood in a run-away
situation, e.g couldn't get rtnl lock. Message flood leads to instability
of system and loss of other crucial messages.

v2: Use exising net_ratelimit() instead of introducing new rate-limit
parameter.

Signed-off-by: Praveen Kumar Kannoju <[email protected]>
---
drivers/net/bonding/bond_main.c | 36 ++++++++++++++++++++----------------
1 file changed, 20 insertions(+), 16 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 4e0600c..e92eba1 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -2610,12 +2610,13 @@ static int bond_miimon_inspect(struct bonding *bond)
commit++;
slave->delay = bond->params.downdelay;
if (slave->delay) {
- slave_info(bond->dev, slave->dev, "link status down for %sinterface, disabling it in %d ms\n",
- (BOND_MODE(bond) ==
- BOND_MODE_ACTIVEBACKUP) ?
- (bond_is_active_slave(slave) ?
- "active " : "backup ") : "",
- bond->params.downdelay * bond->params.miimon);
+ if (net_ratelimit())
+ slave_info(bond->dev, slave->dev, "link status down for %sinterface, disabling it in %d ms\n",
+ (BOND_MODE(bond) ==
+ BOND_MODE_ACTIVEBACKUP) ?
+ (bond_is_active_slave(slave) ?
+ "active " : "backup ") : "",
+ bond->params.downdelay * bond->params.miimon);
}
fallthrough;
case BOND_LINK_FAIL:
@@ -2623,9 +2624,10 @@ static int bond_miimon_inspect(struct bonding *bond)
/* recovered before downdelay expired */
bond_propose_link_state(slave, BOND_LINK_UP);
slave->last_link_up = jiffies;
- slave_info(bond->dev, slave->dev, "link status up again after %d ms\n",
- (bond->params.downdelay - slave->delay) *
- bond->params.miimon);
+ if (net_ratelimit())
+ slave_info(bond->dev, slave->dev, "link status up again after %d ms\n",
+ (bond->params.downdelay - slave->delay) *
+ bond->params.miimon);
commit++;
continue;
}
@@ -2648,18 +2650,20 @@ static int bond_miimon_inspect(struct bonding *bond)
slave->delay = bond->params.updelay;

if (slave->delay) {
- slave_info(bond->dev, slave->dev, "link status up, enabling it in %d ms\n",
- ignore_updelay ? 0 :
- bond->params.updelay *
- bond->params.miimon);
+ if (net_ratelimit())
+ slave_info(bond->dev, slave->dev, "link status up, enabling it in %d ms\n",
+ ignore_updelay ? 0 :
+ bond->params.updelay *
+ bond->params.miimon);
}
fallthrough;
case BOND_LINK_BACK:
if (!link_state) {
bond_propose_link_state(slave, BOND_LINK_DOWN);
- slave_info(bond->dev, slave->dev, "link status down again after %d ms\n",
- (bond->params.updelay - slave->delay) *
- bond->params.miimon);
+ if (net_ratelimit())
+ slave_info(bond->dev, slave->dev, "link status down again after %d ms\n",
+ (bond->params.updelay - slave->delay) *
+ bond->params.miimon);
commit++;
continue;
}
--
1.8.3.1



2024-02-16 09:03:45

by Hangbin Liu

[permalink] [raw]
Subject: Re: [PATCH RFC] bonding: rate-limit bonding driver inspect messages

On Thu, Feb 15, 2024 at 10:55:54PM +0530, Praveen Kumar Kannoju wrote:
> Rate limit bond driver log messages, to prevent a log flood in a run-away
> situation, e.g couldn't get rtnl lock. Message flood leads to instability
> of system and loss of other crucial messages.

Hi Praveen,

The patch looks good to me. But would you please help explain why these
slave_info() are chosen under net_ratelimit?

Thanks
Hangbin
>
> v2: Use exising net_ratelimit() instead of introducing new rate-limit
> parameter.
>
> Signed-off-by: Praveen Kumar Kannoju <[email protected]>
> ---
> drivers/net/bonding/bond_main.c | 36 ++++++++++++++++++++----------------
> 1 file changed, 20 insertions(+), 16 deletions(-)
>
> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
> index 4e0600c..e92eba1 100644
> --- a/drivers/net/bonding/bond_main.c
> +++ b/drivers/net/bonding/bond_main.c
> @@ -2610,12 +2610,13 @@ static int bond_miimon_inspect(struct bonding *bond)
> commit++;
> slave->delay = bond->params.downdelay;
> if (slave->delay) {
> - slave_info(bond->dev, slave->dev, "link status down for %sinterface, disabling it in %d ms\n",
> - (BOND_MODE(bond) ==
> - BOND_MODE_ACTIVEBACKUP) ?
> - (bond_is_active_slave(slave) ?
> - "active " : "backup ") : "",
> - bond->params.downdelay * bond->params.miimon);
> + if (net_ratelimit())
> + slave_info(bond->dev, slave->dev, "link status down for %sinterface, disabling it in %d ms\n",
> + (BOND_MODE(bond) ==
> + BOND_MODE_ACTIVEBACKUP) ?
> + (bond_is_active_slave(slave) ?
> + "active " : "backup ") : "",
> + bond->params.downdelay * bond->params.miimon);
> }
> fallthrough;
> case BOND_LINK_FAIL:
> @@ -2623,9 +2624,10 @@ static int bond_miimon_inspect(struct bonding *bond)
> /* recovered before downdelay expired */
> bond_propose_link_state(slave, BOND_LINK_UP);
> slave->last_link_up = jiffies;
> - slave_info(bond->dev, slave->dev, "link status up again after %d ms\n",
> - (bond->params.downdelay - slave->delay) *
> - bond->params.miimon);
> + if (net_ratelimit())
> + slave_info(bond->dev, slave->dev, "link status up again after %d ms\n",
> + (bond->params.downdelay - slave->delay) *
> + bond->params.miimon);
> commit++;
> continue;
> }
> @@ -2648,18 +2650,20 @@ static int bond_miimon_inspect(struct bonding *bond)
> slave->delay = bond->params.updelay;
>
> if (slave->delay) {
> - slave_info(bond->dev, slave->dev, "link status up, enabling it in %d ms\n",
> - ignore_updelay ? 0 :
> - bond->params.updelay *
> - bond->params.miimon);
> + if (net_ratelimit())
> + slave_info(bond->dev, slave->dev, "link status up, enabling it in %d ms\n",
> + ignore_updelay ? 0 :
> + bond->params.updelay *
> + bond->params.miimon);
> }
> fallthrough;
> case BOND_LINK_BACK:
> if (!link_state) {
> bond_propose_link_state(slave, BOND_LINK_DOWN);
> - slave_info(bond->dev, slave->dev, "link status down again after %d ms\n",
> - (bond->params.updelay - slave->delay) *
> - bond->params.miimon);
> + if (net_ratelimit())
> + slave_info(bond->dev, slave->dev, "link status down again after %d ms\n",
> + (bond->params.updelay - slave->delay) *
> + bond->params.miimon);
> commit++;
> continue;
> }
> --
> 1.8.3.1
>

2024-02-17 12:40:14

by Praveen Kannoju

[permalink] [raw]
Subject: RE: [PATCH RFC] bonding: rate-limit bonding driver inspect messages

> -----Original Message-----
> From: Hangbin Liu <[email protected]>
> Sent: 16 February 2024 02:33 PM
> To: Praveen Kannoju <[email protected]>
> Cc: [email protected]; [email protected]; [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected]; Rajesh Sivaramasubramaniom
> <[email protected]>; Rama Nichanamatlu <[email protected]>; Manjunath Patil
> <[email protected]>
> Subject: Re: [PATCH RFC] bonding: rate-limit bonding driver inspect messages
>
> On Thu, Feb 15, 2024 at 10:55:54PM +0530, Praveen Kumar Kannoju wrote:
> > Rate limit bond driver log messages, to prevent a log flood in a
> > run-away situation, e.g couldn't get rtnl lock. Message flood leads to
> > instability of system and loss of other crucial messages.
>
> Hi Praveen,
>
> The patch looks good to me. But would you please help explain why these
> slave_info() are chosen under net_ratelimit?
>
> Thanks
> Hangbin

Thank you, Hangbin.

The routine bond_mii_monitor() periodically inspects the slave carrier state in order to detect for state changes, on a state change internally records it and does the state change action.

Parked-to-Parked state changes goes through transient state. As an example for Up to Down, BOND_LINK_UP to BOND_LINK_DOWN, is thru BOND_LINK_FAIL. In order to attain next parked state or transient state bond needs rtnl mutex If in a situation it cannot get it, a state change wouldn't happen. In order to achieve a state change as quickly as possible bond_mii_monitor() reschedules itself to come around after 1 msec. And every single come around reinspects the link and sees a state change compared to its internally recorded, which in reality internal state could be not changed earlier as failed to get rtnl lock, and throws again log indicating it sees a state change If attaining rtnl mutex take long say hypothetical 5 secs, then bond logs 5000 state change message. 1 message at every 1 msec.

And in production environments we have seen bond taking long to achieve a state as someone else holding rtnl. Many processes do get rtnl lock. As an example we can see eth drivers. They hold rtnl mutex for the entire duration while performing a fault recovery. There are many such scenarios.

This patch doesn't change -how- bond functions. It only simply limits this kind of log flood.

-
Praveen.
> >
> > v2: Use exising net_ratelimit() instead of introducing new rate-limit
> > parameter.
> >
> > Signed-off-by: Praveen Kumar Kannoju <[email protected]>
> > ---
> > drivers/net/bonding/bond_main.c | 36
> > ++++++++++++++++++++----------------
> > 1 file changed, 20 insertions(+), 16 deletions(-)
> >
> > diff --git a/drivers/net/bonding/bond_main.c
> > b/drivers/net/bonding/bond_main.c index 4e0600c..e92eba1 100644
> > --- a/drivers/net/bonding/bond_main.c
> > +++ b/drivers/net/bonding/bond_main.c
> > @@ -2610,12 +2610,13 @@ static int bond_miimon_inspect(struct bonding *bond)
> > commit++;
> > slave->delay = bond->params.downdelay;
> > if (slave->delay) {
> > - slave_info(bond->dev, slave->dev, "link status down for %sinterface, disabling it in %d ms\n",
> > - (BOND_MODE(bond) ==
> > - BOND_MODE_ACTIVEBACKUP) ?
> > - (bond_is_active_slave(slave) ?
> > - "active " : "backup ") : "",
> > - bond->params.downdelay * bond->params.miimon);
> > + if (net_ratelimit())
> > + slave_info(bond->dev, slave->dev, "link status down for %sinterface, disabling it in %d ms\n",
> > + (BOND_MODE(bond) ==
> > + BOND_MODE_ACTIVEBACKUP) ?
> > + (bond_is_active_slave(slave) ?
> > + "active " : "backup ") : "",
> > + bond->params.downdelay * bond->params.miimon);
> > }
> > fallthrough;
> > case BOND_LINK_FAIL:
> > @@ -2623,9 +2624,10 @@ static int bond_miimon_inspect(struct bonding *bond)
> > /* recovered before downdelay expired */
> > bond_propose_link_state(slave, BOND_LINK_UP);
> > slave->last_link_up = jiffies;
> > - slave_info(bond->dev, slave->dev, "link status up again after %d ms\n",
> > - (bond->params.downdelay - slave->delay) *
> > - bond->params.miimon);
> > + if (net_ratelimit())
> > + slave_info(bond->dev, slave->dev, "link status up again after %d ms\n",
> > + (bond->params.downdelay - slave->delay) *
> > + bond->params.miimon);
> > commit++;
> > continue;
> > }
> > @@ -2648,18 +2650,20 @@ static int bond_miimon_inspect(struct bonding *bond)
> > slave->delay = bond->params.updelay;
> >
> > if (slave->delay) {
> > - slave_info(bond->dev, slave->dev, "link status up, enabling it in %d ms\n",
> > - ignore_updelay ? 0 :
> > - bond->params.updelay *
> > - bond->params.miimon);
> > + if (net_ratelimit())
> > + slave_info(bond->dev, slave->dev, "link status up, enabling it in %d ms\n",
> > + ignore_updelay ? 0 :
> > + bond->params.updelay *
> > + bond->params.miimon);
> > }
> > fallthrough;
> > case BOND_LINK_BACK:
> > if (!link_state) {
> > bond_propose_link_state(slave, BOND_LINK_DOWN);
> > - slave_info(bond->dev, slave->dev, "link status down again after %d ms\n",
> > - (bond->params.updelay - slave->delay) *
> > - bond->params.miimon);
> > + if (net_ratelimit())
> > + slave_info(bond->dev, slave->dev, "link status down again after %d ms\n",
> > + (bond->params.updelay - slave->delay) *
> > + bond->params.miimon);
> > commit++;
> > continue;
> > }
> > --
> > 1.8.3.1
> >

2024-02-18 03:09:41

by Hangbin Liu

[permalink] [raw]
Subject: Re: [PATCH RFC] bonding: rate-limit bonding driver inspect messages

On Sat, Feb 17, 2024 at 12:39:44PM +0000, Praveen Kannoju wrote:
> > -----Original Message-----
> > From: Hangbin Liu <[email protected]>
> > Sent: 16 February 2024 02:33 PM
> > To: Praveen Kannoju <[email protected]>
> > Cc: [email protected]; [email protected]; [email protected]; [email protected]; [email protected];
> > [email protected]; [email protected]; [email protected]; Rajesh Sivaramasubramaniom
> > <[email protected]>; Rama Nichanamatlu <[email protected]>; Manjunath Patil
> > <[email protected]>
> > Subject: Re: [PATCH RFC] bonding: rate-limit bonding driver inspect messages
> >
> > On Thu, Feb 15, 2024 at 10:55:54PM +0530, Praveen Kumar Kannoju wrote:
> > > Rate limit bond driver log messages, to prevent a log flood in a
> > > run-away situation, e.g couldn't get rtnl lock. Message flood leads to
> > > instability of system and loss of other crucial messages.
> >
> > Hi Praveen,
> >
> > The patch looks good to me. But would you please help explain why these
> > slave_info() are chosen under net_ratelimit?
> >
> > Thanks
> > Hangbin
>
> Thank you, Hangbin.
>
> The routine bond_mii_monitor() periodically inspects the slave carrier state in order to detect for state changes, on a state change internally records it and does the state change action.
>
> Parked-to-Parked state changes goes through transient state. As an example for Up to Down, BOND_LINK_UP to BOND_LINK_DOWN, is thru BOND_LINK_FAIL. In order to attain next parked state or transient state bond needs rtnl mutex. If in a situation it cannot get it, a state change wouldn't happen. In order to achieve a state change as quickly as possible bond_mii_monitor() reschedules itself to come around after 1 msec.

I think a large miimon downdelay/updelay setting could reduce this.

> And every single come around reinspects the link and sees a state change compared to its internally recorded, which in reality internal state could be not changed earlier as failed to get rtnl lock, and throws again log indicating it sees a state change. If attaining rtnl mutex take long say hypothetical 5 secs, then bond logs 5000 state change message. 1 message at every 1 msec.

Anyway, setting the rate limit do reduce the message flood. Would you please
summarise the paragraph and add it in commit description when post the formal
patch?

thanks
Hangbin


2024-02-19 11:50:48

by Praveen Kannoju

[permalink] [raw]
Subject: RE: [PATCH RFC] bonding: rate-limit bonding driver inspect messages



> -----Original Message-----
> From: Hangbin Liu <[email protected]>
> Sent: 18 February 2024 08:39 AM
> To: Praveen Kannoju <[email protected]>
> Cc: [email protected]; [email protected]; [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected]; Rajesh Sivaramasubramaniom
> <[email protected]>; Rama Nichanamatlu <[email protected]>; Manjunath Patil
> <[email protected]>
> Subject: Re: [PATCH RFC] bonding: rate-limit bonding driver inspect messages
>
> On Sat, Feb 17, 2024 at 12:39:44PM +0000, Praveen Kannoju wrote:
> > > -----Original Message-----
> > > From: Hangbin Liu <[email protected]>
> > > Sent: 16 February 2024 02:33 PM
> > > To: Praveen Kannoju <[email protected]>
> > > Cc: [email protected]; [email protected]; [email protected];
> > > [email protected]; [email protected]; [email protected];
> > > [email protected]; [email protected]; Rajesh
> > > Sivaramasubramaniom <[email protected]>; Rama
> > > Nichanamatlu <[email protected]>; Manjunath Patil
> > > <[email protected]>
> > > Subject: Re: [PATCH RFC] bonding: rate-limit bonding driver inspect
> > > messages
> > >
> > > On Thu, Feb 15, 2024 at 10:55:54PM +0530, Praveen Kumar Kannoju wrote:
> > > > Rate limit bond driver log messages, to prevent a log flood in a
> > > > run-away situation, e.g couldn't get rtnl lock. Message flood
> > > > leads to instability of system and loss of other crucial messages.
> > >
> > > Hi Praveen,
> > >
> > > The patch looks good to me. But would you please help explain why
> > > these
> > > slave_info() are chosen under net_ratelimit?
> > >
> > > Thanks
> > > Hangbin
> >
> > Thank you, Hangbin.
> >
> > The routine bond_mii_monitor() periodically inspects the slave carrier state in order to detect for state changes, on a state change
> internally records it and does the state change action.
> >
> > Parked-to-Parked state changes goes through transient state. As an example for Up to Down, BOND_LINK_UP to
> BOND_LINK_DOWN, is thru BOND_LINK_FAIL. In order to attain next parked state or transient state bond needs rtnl mutex. If in a
> situation it cannot get it, a state change wouldn't happen. In order to achieve a state change as quickly as possible
> bond_mii_monitor() reschedules itself to come around after 1 msec.
>
> I think a large miimon downdelay/updelay setting could reduce this.
>
> > And every single come around reinspects the link and sees a state change compared to its internally recorded, which in reality
> internal state could be not changed earlier as failed to get rtnl lock, and throws again log indicating it sees a state change. If attaining
> rtnl mutex take long say hypothetical 5 secs, then bond logs 5000 state change message. 1 message at every 1 msec.
>
> Anyway, setting the rate limit do reduce the message flood. Would you please summarise the paragraph and add it in commit
> description when post the formal patch?
>
> thanks
> Hangbin

Thank you very much, Hangbin.

I've added the summary on why we intend to rate-limit the messages in the commit, and re-sent the formal patch.

-
Praveen.