2019-01-06 21:28:40

by Ian Kumlien

[permalink] [raw]
Subject: [BUG] v4.20 - bridge not getting DHCP responses? (works in 4.19.13)

Hi,

Switching from 4.19.x -> 4.20 resulted in DHCP not working for my VM:s.

My firewall (which also runs the dhcpd) runs VM:s and it does this by
having physical
interfaces attached to bridges - which the VM:s in turn attach to.

Since 4.20 the VM:s can't use DHCP, it's odd since the requests are
seen - a response is sent but
it never enters the interface attached to the bridge.

Basically:
VM vnet2: -> br0 -> eno2 -> switch -> eno1 (dhcpd)
dhcpd eno1 -> siwtch and... gone.

Any clues?

All the nics are handled by ixgbe


2019-01-06 22:24:24

by Ian Kumlien

[permalink] [raw]
Subject: Fwd: [BUG] v4.20 - bridge not getting DHCP responses? (works in 4.19.13)

[Sorry for the repost, screwed up the netdev address...]

Hi,

Switching from 4.19.x -> 4.20 resulted in DHCP not working for my VM:s.

My firewall (which also runs the dhcpd) runs VM:s and it does this by
having physical
interfaces attached to bridges - which the VM:s in turn attach to.

Since 4.20 the VM:s can't use DHCP, it's odd since the requests are
seen - a response is sent but
it never enters the interface attached to the bridge.

Basically:
VM vnet2: -> br0 -> eno2 -> switch -> eno1 (dhcpd)
dhcpd eno1 -> siwtch and... gone.

Any clues?

All the nics are handled by ixgbe

2019-01-08 22:57:03

by Ian Kumlien

[permalink] [raw]
Subject: Re: [BUG] v4.20 - bridge not getting DHCP responses? (works in 4.19.13)

On Sun, Jan 6, 2019 at 11:21 PM Ian Kumlien <[email protected]> wrote:
>
> [Sorry for the repost, screwed up the netdev address...]
>
> Hi,
>
> Switching from 4.19.x -> 4.20 resulted in DHCP not working for my VM:s.
>
> My firewall (which also runs the dhcpd) runs VM:s and it does this by
> having physical
> interfaces attached to bridges - which the VM:s in turn attach to.
>
> Since 4.20 the VM:s can't use DHCP, it's odd since the requests are
> seen - a response is sent but
> it never enters the interface attached to the bridge.
>
> Basically:
> VM vnet2: -> br0 -> eno2 -> switch -> eno1 (dhcpd)
> dhcpd eno1 -> siwtch and... gone.
>
> Any clues?
>
> All the nics are handled by ixgbe

So, doing similar tests at work with other drivers works - could it be
related to the mac address filter that was added?
I don't *really* use VF:s though... (can't really find anything else atm)

Will try to test, but the VM:s on this machine is in use.

2019-01-08 23:06:40

by Stephen Hemminger

[permalink] [raw]
Subject: Re: [BUG] v4.20 - bridge not getting DHCP responses? (works in 4.19.13)

On Tue, 8 Jan 2019 23:10:04 +0100
Ian Kumlien <[email protected]> wrote:

> On Sun, Jan 6, 2019 at 11:21 PM Ian Kumlien <[email protected]> wrote:
> >
> > [Sorry for the repost, screwed up the netdev address...]
> >
> > Hi,
> >
> > Switching from 4.19.x -> 4.20 resulted in DHCP not working for my VM:s.
> >
> > My firewall (which also runs the dhcpd) runs VM:s and it does this by
> > having physical
> > interfaces attached to bridges - which the VM:s in turn attach to.
> >
> > Since 4.20 the VM:s can't use DHCP, it's odd since the requests are
> > seen - a response is sent but
> > it never enters the interface attached to the bridge.
> >
> > Basically:
> > VM vnet2: -> br0 -> eno2 -> switch -> eno1 (dhcpd)
> > dhcpd eno1 -> siwtch and... gone.
> >
> > Any clues?
> >
> > All the nics are handled by ixgbe
>
> So, doing similar tests at work with other drivers works - could it be
> related to the mac address filter that was added?
> I don't *really* use VF:s though... (can't really find anything else atm)
>
> Will try to test, but the VM:s on this machine is in use.

The default MAC address of the bridge device is the first device assigned
to the bridge. Remember most VF interfaces will only allow single MAC address
and no promiscious mode.

2019-01-08 23:07:15

by Ian Kumlien

[permalink] [raw]
Subject: Re: [BUG] v4.20 - bridge not getting DHCP responses? (works in 4.19.13)

On Tue, Jan 8, 2019 at 11:51 PM Stephen Hemminger
<[email protected]> wrote:
> On Tue, 8 Jan 2019 23:10:04 +0100
> Ian Kumlien <[email protected]> wrote:
> > On Sun, Jan 6, 2019 at 11:21 PM Ian Kumlien <[email protected]> wrote:
> > >
> > > [Sorry for the repost, screwed up the netdev address...]
> > >
> > > Hi,
> > >
> > > Switching from 4.19.x -> 4.20 resulted in DHCP not working for my VM:s.
> > >
> > > My firewall (which also runs the dhcpd) runs VM:s and it does this by
> > > having physical
> > > interfaces attached to bridges - which the VM:s in turn attach to.
> > >
> > > Since 4.20 the VM:s can't use DHCP, it's odd since the requests are
> > > seen - a response is sent but
> > > it never enters the interface attached to the bridge.
> > >
> > > Basically:
> > > VM vnet2: -> br0 -> eno2 -> switch -> eno1 (dhcpd)
> > > dhcpd eno1 -> siwtch and... gone.
> > >
> > > Any clues?
> > >
> > > All the nics are handled by ixgbe
> >
> > So, doing similar tests at work with other drivers works - could it be
> > related to the mac address filter that was added?
> > I don't *really* use VF:s though... (can't really find anything else atm)
> >
> > Will try to test, but the VM:s on this machine is in use.
>
> The default MAC address of the bridge device is the first device assigned
> to the bridge. Remember most VF interfaces will only allow single MAC address
> and no promiscious mode.

Yeah, I'm not running any VF:s and it just seems like the responses
are dropped somewhere

when looking at "git log v4.19...v4.20
drivers/net/ethernet/intel/ixgbe/" nothing else really stands out...
The machine is also running NAT for my home network and all of that
works just fine...

I started with tcpdump, prooving that packets reached all the way
outside but replies never made it, reboorting
with 4.19.13 resulted in replies appearing in the tcpdump.

I don't quite know where to look - and what can i do to test - i tried
disabling all offloading (due to the UDP
offloading changes) but nothing helped...

Ideas? Patches? ;)

2019-01-08 23:11:32

by Florian Fainelli

[permalink] [raw]
Subject: Re: [BUG] v4.20 - bridge not getting DHCP responses? (works in 4.19.13)

On 1/8/19 3:00 PM, Ian Kumlien wrote:
> On Tue, Jan 8, 2019 at 11:51 PM Stephen Hemminger
> <[email protected]> wrote:
>> On Tue, 8 Jan 2019 23:10:04 +0100
>> Ian Kumlien <[email protected]> wrote:
>>> On Sun, Jan 6, 2019 at 11:21 PM Ian Kumlien <[email protected]> wrote:
>>>>
>>>> [Sorry for the repost, screwed up the netdev address...]
>>>>
>>>> Hi,
>>>>
>>>> Switching from 4.19.x -> 4.20 resulted in DHCP not working for my VM:s.
>>>>
>>>> My firewall (which also runs the dhcpd) runs VM:s and it does this by
>>>> having physical
>>>> interfaces attached to bridges - which the VM:s in turn attach to.
>>>>
>>>> Since 4.20 the VM:s can't use DHCP, it's odd since the requests are
>>>> seen - a response is sent but
>>>> it never enters the interface attached to the bridge.
>>>>
>>>> Basically:
>>>> VM vnet2: -> br0 -> eno2 -> switch -> eno1 (dhcpd)
>>>> dhcpd eno1 -> siwtch and... gone.
>>>>
>>>> Any clues?
>>>>
>>>> All the nics are handled by ixgbe
>>>
>>> So, doing similar tests at work with other drivers works - could it be
>>> related to the mac address filter that was added?
>>> I don't *really* use VF:s though... (can't really find anything else atm)
>>>
>>> Will try to test, but the VM:s on this machine is in use.
>>
>> The default MAC address of the bridge device is the first device assigned
>> to the bridge. Remember most VF interfaces will only allow single MAC address
>> and no promiscious mode.
>
> Yeah, I'm not running any VF:s and it just seems like the responses
> are dropped somewhere
>
> when looking at "git log v4.19...v4.20
> drivers/net/ethernet/intel/ixgbe/" nothing else really stands out...
> The machine is also running NAT for my home network and all of that
> works just fine...
>
> I started with tcpdump, prooving that packets reached all the way
> outside but replies never made it, reboorting
> with 4.19.13 resulted in replies appearing in the tcpdump.
>
> I don't quite know where to look - and what can i do to test - i tried
> disabling all offloading (due to the UDP
> offloading changes) but nothing helped...
>
> Ideas? Patches? ;)

Running a bisection would certainly help find the offending commit if
that is something that you can do?
--
Florian

2019-01-10 00:18:08

by Ian Kumlien

[permalink] [raw]
Subject: Re: [BUG] v4.20 - bridge not getting DHCP responses? (works in 4.19.13)

On Wed, Jan 9, 2019 at 12:17 AM Ian Kumlien <[email protected]> wrote:
> On Wed, Jan 9, 2019, 00:09 Florian Fainelli <[email protected] wrote:

[--8<---]

>> > when looking at "git log v4.19...v4.20
>> > drivers/net/ethernet/intel/ixgbe/" nothing else really stands out...
>> > The machine is also running NAT for my home network and all of that
>> > works just fine...
>> >
>> > I started with tcpdump, prooving that packets reached all the way
>> > outside but replies never made it, reboorting
>> > with 4.19.13 resulted in replies appearing in the tcpdump.
>> >
>> > I don't quite know where to look - and what can i do to test - i tried
>> > disabling all offloading (due to the UDP
>> > offloading changes) but nothing helped...
>> >
>> > Ideas? Patches? ;)
>>
>> Running a bisection would certainly help find the offending commit if
>> that is something that you can do?
>
> I was hoping for a likely suspect but this was on my "Todo" for Friday night anyway... (And I already started testing with some patches reversed)

So after lengthy git bisect sections, both from the latest stable i
was using (not the best of ideas)
and from 4.19.

The latest stable yielded 72b0094f918294e6cb8cf5c3b4520d928fbb1a57 -
which is incorrect...

However, the proper bisect gave me this:
fb420d5d91c1274d5966917725e71f27ed092a85 is the first bad commit
commit fb420d5d91c1274d5966917725e71f27ed092a85
Author: Eric Dumazet <[email protected]>
Date: Fri Sep 28 10:28:44 2018 -0700

tcp/fq: move back to CLOCK_MONOTONIC

In the recent TCP/EDT patch series, I switched TCP and sch_fq
clocks from MONOTONIC to TAI, in order to meet the choice done
earlier for sch_etf packet scheduler.

But sure enough, this broke some setups were the TAI clock
jumps forward (by almost 50 year...), as reported
by Leonard Crestez.

If we want to converge later, we'll probably need to add
an skb field to differentiate the clock bases, or a socket option.

In the meantime, an UDP application will need to use CLOCK_MONOTONIC
base for its SCM_TXTIME timestamps if using fq packet scheduler.

Fixes: 72b0094f9182 ("tcp: switch tcp_clock_ns() to CLOCK_TAI base")
Fixes: 142537e41923 ("net_sched: sch_fq: switch to CLOCK_TAI")
Fixes: fd2bca2aa789 ("tcp: switch internal pacing timer to CLOCK_TAI")
Signed-off-by: Eric Dumazet <[email protected]>
Reported-by: Leonard Crestez <[email protected]>
Tested-by: Leonard Crestez <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

:040000 040000 06615f5ed4486fd0af77a8fb59775a9f2346aebc
7f883c7753cb3d5d881e0edbef2989f4e6db6a1f M include
:040000 040000 767c5e93fe5cfd609f90834d93978511c284ea01
cc47bd361516622c0b21602e188181fdfc6b2995 M net
----

Which could actually be the culprit - I'm having problems *with* UDP
traffic (DHCP) and I am using fq

Lets hope it's so, since this was kinda boring:
ls /lib/modules |grep 4.19.0 |wc -l
27

Testing 4.20.1 and then 4.20.1 with the suspected patch reverted, will
report shortly!

2019-01-10 00:40:33

by Ian Kumlien

[permalink] [raw]
Subject: Re: [BUG] v4.20 - bridge not getting DHCP responses? (works in 4.19.13)

Confirmed, sending a new mail with summary etc

On Thu, Jan 10, 2019 at 1:16 AM Ian Kumlien <[email protected]> wrote:
>
> On Wed, Jan 9, 2019 at 12:17 AM Ian Kumlien <[email protected]> wrote:
> > On Wed, Jan 9, 2019, 00:09 Florian Fainelli <[email protected] wrote:
>
> [--8<---]
>
> >> > when looking at "git log v4.19...v4.20
> >> > drivers/net/ethernet/intel/ixgbe/" nothing else really stands out...
> >> > The machine is also running NAT for my home network and all of that
> >> > works just fine...
> >> >
> >> > I started with tcpdump, prooving that packets reached all the way
> >> > outside but replies never made it, reboorting
> >> > with 4.19.13 resulted in replies appearing in the tcpdump.
> >> >
> >> > I don't quite know where to look - and what can i do to test - i tried
> >> > disabling all offloading (due to the UDP
> >> > offloading changes) but nothing helped...
> >> >
> >> > Ideas? Patches? ;)
> >>
> >> Running a bisection would certainly help find the offending commit if
> >> that is something that you can do?
> >
> > I was hoping for a likely suspect but this was on my "Todo" for Friday night anyway... (And I already started testing with some patches reversed)
>
> So after lengthy git bisect sections, both from the latest stable i
> was using (not the best of ideas)
> and from 4.19.
>
> The latest stable yielded 72b0094f918294e6cb8cf5c3b4520d928fbb1a57 -
> which is incorrect...
>
> However, the proper bisect gave me this:
> fb420d5d91c1274d5966917725e71f27ed092a85 is the first bad commit
> commit fb420d5d91c1274d5966917725e71f27ed092a85
> Author: Eric Dumazet <[email protected]>
> Date: Fri Sep 28 10:28:44 2018 -0700
>
> tcp/fq: move back to CLOCK_MONOTONIC
>
> In the recent TCP/EDT patch series, I switched TCP and sch_fq
> clocks from MONOTONIC to TAI, in order to meet the choice done
> earlier for sch_etf packet scheduler.
>
> But sure enough, this broke some setups were the TAI clock
> jumps forward (by almost 50 year...), as reported
> by Leonard Crestez.
>
> If we want to converge later, we'll probably need to add
> an skb field to differentiate the clock bases, or a socket option.
>
> In the meantime, an UDP application will need to use CLOCK_MONOTONIC
> base for its SCM_TXTIME timestamps if using fq packet scheduler.
>
> Fixes: 72b0094f9182 ("tcp: switch tcp_clock_ns() to CLOCK_TAI base")
> Fixes: 142537e41923 ("net_sched: sch_fq: switch to CLOCK_TAI")
> Fixes: fd2bca2aa789 ("tcp: switch internal pacing timer to CLOCK_TAI")
> Signed-off-by: Eric Dumazet <[email protected]>
> Reported-by: Leonard Crestez <[email protected]>
> Tested-by: Leonard Crestez <[email protected]>
> Signed-off-by: David S. Miller <[email protected]>
>
> :040000 040000 06615f5ed4486fd0af77a8fb59775a9f2346aebc
> 7f883c7753cb3d5d881e0edbef2989f4e6db6a1f M include
> :040000 040000 767c5e93fe5cfd609f90834d93978511c284ea01
> cc47bd361516622c0b21602e188181fdfc6b2995 M net
> ----
>
> Which could actually be the culprit - I'm having problems *with* UDP
> traffic (DHCP) and I am using fq
>
> Lets hope it's so, since this was kinda boring:
> ls /lib/modules |grep 4.19.0 |wc -l
> 27
>
> Testing 4.20.1 and then 4.20.1 with the suspected patch reverted, will
> report shortly!

2019-01-10 08:09:38

by Paolo Abeni

[permalink] [raw]
Subject: Re: [BUG] v4.20 - bridge not getting DHCP responses? (works in 4.19.13)

On Thu, 2019-01-10 at 01:38 +0100, Ian Kumlien wrote:
> Confirmed, sending a new mail with summary etc
>
> On Thu, Jan 10, 2019 at 1:16 AM Ian Kumlien <[email protected]> wrote:
> > On Wed, Jan 9, 2019 at 12:17 AM Ian Kumlien <[email protected]> wrote:
> > > On Wed, Jan 9, 2019, 00:09 Florian Fainelli <[email protected] wrote:
> >
> > [--8<---]
> >
> > > > > when looking at "git log v4.19...v4.20
> > > > > drivers/net/ethernet/intel/ixgbe/" nothing else really stands out...
> > > > > The machine is also running NAT for my home network and all of that
> > > > > works just fine...
> > > > >
> > > > > I started with tcpdump, prooving that packets reached all the way
> > > > > outside but replies never made it, reboorting
> > > > > with 4.19.13 resulted in replies appearing in the tcpdump.
> > > > >
> > > > > I don't quite know where to look - and what can i do to test - i tried
> > > > > disabling all offloading (due to the UDP
> > > > > offloading changes) but nothing helped...
> > > > >
> > > > > Ideas? Patches? ;)
> > > >
> > > > Running a bisection would certainly help find the offending commit if
> > > > that is something that you can do?
> > >
> > > I was hoping for a likely suspect but this was on my "Todo" for Friday night anyway... (And I already started testing with some patches reversed)
> >
> > So after lengthy git bisect sections, both from the latest stable i
> > was using (not the best of ideas)
> > and from 4.19.
> >
> > The latest stable yielded 72b0094f918294e6cb8cf5c3b4520d928fbb1a57 -
> > which is incorrect...
> >
> > However, the proper bisect gave me this:
> > fb420d5d91c1274d5966917725e71f27ed092a85 is the first bad commit
> > commit fb420d5d91c1274d5966917725e71f27ed092a85
> > Author: Eric Dumazet <[email protected]>
> > Date: Fri Sep 28 10:28:44 2018 -0700
> >
> > tcp/fq: move back to CLOCK_MONOTONIC

Thank you for bisecting.

Should be solve by:

https://marc.info/?l=linux-netdev&m=154696956604748&w=2

Can you test with the above applied?

Thanks,

Paolo