2012-10-01 09:33:11

by Michael Chan

[permalink] [raw]
Subject: Re: tg3 driver upgrade (Linux 2.6.32 -> 3.2) breaks IBM Bladecenter SoL

On Fri, 2012-09-28 at 22:45 +0200, Ferenc Wagner wrote:
> Hi,
>
> Upgrading the kernel on our HS20 blades resulted in their SoL (serial
> over LAN) connection being broken. The disconnection happens when eth0
> (the interface involved in SoL) is brought up during the boot sequence.
> If I later "ip link set eth0 down", then the connection is restored, but
> "ip link set eth0 up" breaks it again on 3.2. ethtool -a, -c, -g, -k
> and -u show no difference; ethtool -i on the 2.6.32 kernel reports:
>
> driver: tg3
> version: 3.116
> firmware-version: 5704s-v3.38, ASFIPMIs v2.47
> bus-info: 0000:05:01.0
>
> In the 3.2 kernel the driver version is 3.121.

2.6.32 to 3.2 is a big jump. Can you narrow this down further? It will
be hard for us to find a HS20 with 5704 to test this. Thanks.


2012-10-02 09:31:33

by Ferenc Wagner

[permalink] [raw]
Subject: Re: tg3 driver upgrade (Linux 2.6.32 -> 3.2) breaks IBM Bladecenter SoL

"Michael Chan" <[email protected]> writes:

> On Fri, 2012-09-28 at 22:45 +0200, Ferenc Wagner wrote:
>
>> Upgrading the kernel on our HS20 blades resulted in their SoL (serial
>> over LAN) connection being broken. The disconnection happens when eth0
>> (the interface involved in SoL) is brought up during the boot sequence.
>> If I later "ip link set eth0 down", then the connection is restored, but
>> "ip link set eth0 up" breaks it again on 3.2. ethtool -a, -c, -g, -k
>> and -u show no difference; ethtool -i on the 2.6.32 kernel reports:
>>
>> driver: tg3
>> version: 3.116
>> firmware-version: 5704s-v3.38, ASFIPMIs v2.47
>> bus-info: 0000:05:01.0
>>
>> In the 3.2 kernel the driver version is 3.121.
>
> 2.6.32 to 3.2 is a big jump. Can you narrow this down further? It will
> be hard for us to find a HS20 with 5704 to test this. Thanks.

Certainly, I'm bisecting it now, but I thought I would drop in the
question in case it rings some bells somewhere. Given the nature of the
problem, it isn't much fun to bisect, and the stripped down kernel I'm
testing with breaks the SoL connection for a couple of seconds even in
the "good" cases. I'm already down to 13 steps...
--
Thanks,
Feri.

2012-10-02 12:08:11

by Ferenc Wagner

[permalink] [raw]
Subject: Re: tg3 driver upgrade (Linux 2.6.32 -> 3.2) breaks IBM Bladecenter SoL

"Michael Chan" <[email protected]> writes:

> On Fri, 2012-09-28 at 22:45 +0200, Ferenc Wagner wrote:
>
>> Upgrading the kernel on our HS20 blades resulted in their SoL (serial
>> over LAN) connection being broken. The disconnection happens when eth0
>> (the interface involved in SoL) is brought up during the boot sequence.
>> If I later "ip link set eth0 down", then the connection is restored, but
>> "ip link set eth0 up" breaks it again on 3.2. ethtool -a, -c, -g, -k
>> and -u show no difference; ethtool -i on the 2.6.32 kernel reports:
>>
>> driver: tg3
>> version: 3.116
>> firmware-version: 5704s-v3.38, ASFIPMIs v2.47
>> bus-info: 0000:05:01.0
>>
>> In the 3.2 kernel the driver version is 3.121.
>
> 2.6.32 to 3.2 is a big jump. Can you narrow this down further? It will
> be hard for us to find a HS20 with 5704 to test this. Thanks.

I'm done with bisecting it: the first bad commit is:

commit dabc5c670d3f86d15ee4f42ab38ec5bd2682487d
Author: Matt Carlson <[email protected]>
Date: Thu May 19 12:12:52 2011 +0000

tg3: Move TSO_CAPABLE assignment

This patch moves the code that asserts the TSO_CAPABLE flag closer to
where the TSO capabilities flags are set. There isn't a good enough
reason for the code to be separated.

Signed-off-by: Matt Carlson <[email protected]>
Reviewed-by: Michael Chan <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

On the other hand, losing the SoL console even temporarily during boot
(as it happens with a minimal kernel before this commit) isn't nice
either. I'll try to look after that, too, just mentioning it here...
--
Regards,
Feri.

2012-10-02 15:03:31

by Michael Chan

[permalink] [raw]
Subject: Re: tg3 driver upgrade (Linux 2.6.32 -> 3.2) breaks IBM Bladecenter SoL

On Tue, 2012-10-02 at 14:07 +0200, Ferenc Wagner wrote:
> I'm done with bisecting it: the first bad commit is:
>
> commit dabc5c670d3f86d15ee4f42ab38ec5bd2682487d
> Author: Matt Carlson <[email protected]>
> Date: Thu May 19 12:12:52 2011 +0000
>
> tg3: Move TSO_CAPABLE assignment
>
> This patch moves the code that asserts the TSO_CAPABLE flag closer
> to
> where the TSO capabilities flags are set. There isn't a good
> enough
> reason for the code to be separated.
>
> Signed-off-by: Matt Carlson <[email protected]>
> Reviewed-by: Michael Chan <[email protected]>
> Signed-off-by: David S. Miller <[email protected]>

Thanks, I'll look into this.
>
> On the other hand, losing the SoL console even temporarily during boot
> (as it happens with a minimal kernel before this commit) isn't nice
> either. I'll try to look after that, too, just mentioning it here...

This is expected as the driver has to reset the link and you'll lose SoL
for a few seconds until link comes back up. We can look into an
enhancement to not touch the link if it is already in a good state when
the driver comes up.

2012-10-02 16:54:21

by Ferenc Wagner

[permalink] [raw]
Subject: Re: tg3 driver upgrade (Linux 2.6.32 -> 3.2) breaks IBM Bladecenter SoL

"Michael Chan" <[email protected]> writes:

> On Tue, 2012-10-02 at 14:07 +0200, Ferenc Wagner wrote:
>
>> I'm done with bisecting it: the first bad commit is:
>>
>> commit dabc5c670d3f86d15ee4f42ab38ec5bd2682487d
>> Author: Matt Carlson <[email protected]>
>> Date: Thu May 19 12:12:52 2011 +0000
>>
>> tg3: Move TSO_CAPABLE assignment
>>
>> This patch moves the code that asserts the TSO_CAPABLE flag closer to
>> where the TSO capabilities flags are set. There isn't a good enough
>> reason for the code to be separated.
>>
>> Signed-off-by: Matt Carlson <[email protected]>
>> Reviewed-by: Michael Chan <[email protected]>
>> Signed-off-by: David S. Miller <[email protected]>
>
> Thanks, I'll look into this.

Going into the opposite direction: I found that Linux 3.6 does not
permanently break the SoL console on upping eth0! I'll try to find the
commit which (sort of) fixed it.

>> On the other hand, losing the SoL console even temporarily during boot
>> (as it happens with a minimal kernel before this commit) isn't nice
>> either. I'll try to look after that, too, just mentioning it here...
>
> This is expected as the driver has to reset the link and you'll lose SoL
> for a few seconds until link comes back up. We can look into an
> enhancement to not touch the link if it is already in a good state when
> the driver comes up.

This looks more complicated here. In our production setup under 2.6.32
(stock Debian squeeze system) the SoL console is not broken during boot
at all. I don't say there are no dropouts at all, but the management
system does not detach the console, like it promptly did during the
bisection in every case. I could not reproduce this (preferred)
behavior with self-built kernels yet (not even with 2.6.18, which also
worked fine when built by Debian, if I remember correctly. I'll
continue investigating this issue.
--
Thanks,
Feri.

2012-10-02 17:07:49

by Michael Chan

[permalink] [raw]
Subject: Re: tg3 driver upgrade (Linux 2.6.32 -> 3.2) breaks IBM Bladecenter SoL

On Tue, 2012-10-02 at 18:49 +0200, Ferenc Wagner wrote:
> Going into the opposite direction: I found that Linux 3.6 does not
> permanently break the SoL console on upping eth0! I'll try to find
> the
> commit which (sort of) fixed it.

These are the likely fixes:
>
commit cf9ecf4b631f649a964fa611f1a5e8874f2a76db
Author: Matt Carlson <[email protected]>
Date: Mon Nov 28 09:41:03 2011 +0000

tg3: Fix TSO CAP for 5704 devs w / ASF enabled

commit 7196cd6c3d4863000ef88b09f34d6dd75610ec3e
Author: Matt Carlson <[email protected]>
Date: Thu May 19 16:02:44 2011 +0000

tg3: Add braces around 5906 workaround.


2012-10-02 18:49:29

by Ferenc Wagner

[permalink] [raw]
Subject: Re: tg3 driver upgrade (Linux 2.6.32 -> 3.2) breaks IBM Bladecenter SoL

"Michael Chan" <[email protected]> writes:

> On Tue, 2012-10-02 at 18:49 +0200, Ferenc Wagner wrote:
>
>> Going into the opposite direction: I found that Linux 3.6 does not
>> permanently break the SoL console on upping eth0! I'll try to find
>> the commit which (sort of) fixed it.
>
> These are the likely fixes:
>
> commit cf9ecf4b631f649a964fa611f1a5e8874f2a76db
> Author: Matt Carlson <[email protected]>
> Date: Mon Nov 28 09:41:03 2011 +0000
>
> tg3: Fix TSO CAP for 5704 devs w / ASF enabled

You are exactly right: cf9ecf4b fixed the premanent SoL breakage
introduced by dabc5c67. Looks like ASF utilizes similar technology to
that of the HS20 BMC. Thanks for the tip, it greatly reduced our CPU
wear. :) It's a pity ethtool -k did not give a hint. Do you think it's
possible to work around in 3.2 by eg. fiddling some ethtool setting?
--
Thanks,
Feri.

2012-10-02 19:07:04

by Michael Tokarev

[permalink] [raw]
Subject: Re: tg3 driver upgrade (Linux 2.6.32 -> 3.2) breaks IBM Bladecenter SoL

On 02.10.2012 22:49, Ferenc Wagner wrote:
> "Michael Chan" <[email protected]> writes:
>> These are the likely fixes:
>>
>> commit cf9ecf4b631f649a964fa611f1a5e8874f2a76db
>> Author: Matt Carlson <[email protected]>
>> Date: Mon Nov 28 09:41:03 2011 +0000
>>
>> tg3: Fix TSO CAP for 5704 devs w / ASF enabled
>
> You are exactly right: cf9ecf4b fixed the premanent SoL breakage
> introduced by dabc5c67. Looks like ASF utilizes similar technology to
> that of the HS20 BMC. Thanks for the tip, it greatly reduced our CPU
> wear. :) It's a pity ethtool -k did not give a hint. Do you think it's
> possible to work around in 3.2 by eg. fiddling some ethtool setting?

Maybe it's better to push this commit to -stable instead? (the commit
that broke things is part of 3.0 kernel so all current 3.x -stable
kernels are affected)

(Besides, that commit "This patch fixes the problem by revisiting and
reevaluating the decision after tg3_get_eeprom_hw_cfg() is called." -
merely copies a somewhat "twisted" chunk of code into another place,
which does not look optimal)

Thanks,

/mjt

2012-10-03 00:17:43

by Ben Hutchings

[permalink] [raw]
Subject: Re: tg3 driver upgrade (Linux 2.6.32 -> 3.2) breaks IBM Bladecenter SoL

On Tue, 2012-10-02 at 23:06 +0400, Michael Tokarev wrote:
> On 02.10.2012 22:49, Ferenc Wagner wrote:
> > "Michael Chan" <[email protected]> writes:
> >> These are the likely fixes:
> >>
> >> commit cf9ecf4b631f649a964fa611f1a5e8874f2a76db
> >> Author: Matt Carlson <[email protected]>
> >> Date: Mon Nov 28 09:41:03 2011 +0000
> >>
> >> tg3: Fix TSO CAP for 5704 devs w / ASF enabled
> >
> > You are exactly right: cf9ecf4b fixed the premanent SoL breakage
> > introduced by dabc5c67. Looks like ASF utilizes similar technology to
> > that of the HS20 BMC. Thanks for the tip, it greatly reduced our CPU
> > wear. :) It's a pity ethtool -k did not give a hint. Do you think it's
> > possible to work around in 3.2 by eg. fiddling some ethtool setting?
>
> Maybe it's better to push this commit to -stable instead?

But that will take time, so I imagine a temporary workaround would be
useful to Ferenc.

> (the commit
> that broke things is part of 3.0 kernel so all current 3.x -stable
> kernels are affected)
[...]

The fix went into 3.3, so only 3.0 and 3.2 need it.

David, please can you include the above commit in your next batches for
these stable series?

Ben.

--
Ben Hutchings
For every complex problem
there is a solution that is simple, neat, and wrong.


Attachments:
signature.asc (828.00 B)
This is a digitally signed message part

2012-10-03 00:47:13

by David Miller

[permalink] [raw]
Subject: Re: tg3 driver upgrade (Linux 2.6.32 -> 3.2) breaks IBM Bladecenter SoL

From: Ben Hutchings <[email protected]>
Date: Wed, 03 Oct 2012 01:17:12 +0100

> On Tue, 2012-10-02 at 23:06 +0400, Michael Tokarev wrote:
>> On 02.10.2012 22:49, Ferenc Wagner wrote:
>> > "Michael Chan" <[email protected]> writes:
>> >> These are the likely fixes:
>> >>
>> >> commit cf9ecf4b631f649a964fa611f1a5e8874f2a76db
>> >> Author: Matt Carlson <[email protected]>
>> >> Date: Mon Nov 28 09:41:03 2011 +0000
>> >>
>> >> tg3: Fix TSO CAP for 5704 devs w / ASF enabled
...
> The fix went into 3.3, so only 3.0 and 3.2 need it.
>
> David, please can you include the above commit in your next batches for
> these stable series?

Done.