Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751786AbdLARg7 (ORCPT ); Fri, 1 Dec 2017 12:36:59 -0500 Received: from mail-wr0-f194.google.com ([209.85.128.194]:40510 "EHLO mail-wr0-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750995AbdLARg5 (ORCPT ); Fri, 1 Dec 2017 12:36:57 -0500 X-Google-Smtp-Source: AGs4zMbD2u0bIoFEMcZnZLy8LgYsFatb2N41UZNKVuiwKrvkt1o18+RNuyhI5ZkQynjCDLIiB2M/Ng== Subject: Re: [EXT] Re: [PATCH net] net: phylink: fix link state on phy-connect To: Russell King - ARM Linux , Grygorii Strashko Cc: Yan Markman , Antoine Tenart , "andrew@lunn.ch" , "davem@davemloft.net" , "gregory.clement@free-electrons.com" , "thomas.petazzoni@free-electrons.com" , "miquel.raynal@free-electrons.com" , Nadav Haklai , "mw@semihalf.com" , Stefan Chulski , "netdev@vger.kernel.org" , "linux-kernel@vger.kernel.org" References: <20171128155317.GA7974@flint.armlinux.org.uk> <20171128155611.GA8358@flint.armlinux.org.uk> <20448667430e434aad5bb8cd1b082611@IL-EXCH01.marvell.com> <20171129195911.GG8356@n2100.armlinux.org.uk> <21ec97be76d54a6c8a80fd5b56d35678@IL-EXCH01.marvell.com> <20171129212032.GI8356@n2100.armlinux.org.uk> <20171130101018.GA10595@n2100.armlinux.org.uk> <20171130132830.GA5529@n2100.armlinux.org.uk> <51496048-7e11-e0aa-7c47-3c04eee70e3a@ti.com> <20171201172440.GK10595@n2100.armlinux.org.uk> From: Florian Fainelli Message-ID: <221420a4-9f56-373e-f5cd-0d2fcb02e5fb@gmail.com> Date: Fri, 1 Dec 2017 09:36:42 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.2.1 MIME-Version: 1.0 In-Reply-To: <20171201172440.GK10595@n2100.armlinux.org.uk> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4052 Lines: 101 On 12/01/2017 09:24 AM, Russell King - ARM Linux wrote: > On Fri, Dec 01, 2017 at 11:07:22AM -0600, Grygorii Strashko wrote: >> Hi Russell, >> >> On 11/30/2017 07:28 AM, Russell King - ARM Linux wrote: >>> On Thu, Nov 30, 2017 at 10:10:18AM +0000, Russell King - ARM Linux wrote: >>>> On Thu, Nov 30, 2017 at 08:51:21AM +0000, Yan Markman wrote: >>>>> The phylink_stop is called before phylink_disconnect_phy >>>>> You could see in mvpp2.c: >>>>> >>>>> mvpp2_stop_dev() { >>>>> phylink_stop(port->phylink); >>>>> } >>>>> >>>>> mvpp2_stop() { >>>>> mvpp2_stop_dev(port); >>>>> phylink_disconnect_phy(port->phylink); >>>>> } >>>>> >>>>> .ndo_stop = mvpp2_stop, >>>> >>>> Sorry, I don't have this in mvpp2.c, so I have no visibility of what >>>> you're working with. >>>> >>>> What you have above looks correct, and I see no reason why the p21 >>>> patch would not have resolved your issue. The p21 patch ensures >>>> that phylink_resolve() gets called and completes before phylink_stop() >>>> returns. In that case, phylink_resolve() will call the mac_link_down() >>>> method if the link is not already down. It will also print the "Link >>>> is Down" message. >>>> >>>> Florian has already tested this patch after encountering a similar >>>> issue, and has reported that it solves the problem for him. I've also >>>> tested it with mvneta, and the original mvpp2x driver on Macchiatobin. >>>> >>>> Maybe there's something different about mvpp2, but as I have no >>>> visibility of that driver and the modifications therein, I can't >>>> comment further other than stating that it works for three different >>>> implementations. >>>> >>>> Maybe you could try and work out what's going on with the p21 patch >>>> in your case? >>> >>> I think I now realise what's probably going on. >>> >>> If you call netif_carrier_off() before phylink_stop(), then phylink will >>> believe that the link is already down, and so it won't bother calling >>> mac_link_down() - it will believe that the link is already down. >>> >>> I'll update the documentation for phylink_stop() to spell out this >>> aspect. >>> >> >> There are pretty high number of net drivers which do call >> netif_carrier_off(dev); >> before >> phy_stop(dev->phydev); >> in .ndo_stop() callback. >> >> As per you comment this seems to be incorrect, so should such calls be >> removed? > > Well, I think the question that needs to be asked is this: > > Is calling netif_carrier_off() before phy_stop() safe? > > Well, reading the phylib code, this is the answer I've come to: > > Between phy_start() and phy_stop(), phylib is free to manage the > carrier state itself through the phylib state machine. > > This means if you call netif_carrier_off() prior to phy_stop(), > there is nothing preventing the phylib state machine from running, > and a co-incident poll of the PHY could notice that the link has > come up, and re-enable the carrier while your ndo_stop() method > is still running. > > So, my conclusion is that this practice is provably racy, though > it's probably not that easy to trigger the race (which is probably > why no one has reported the problem.) > > Given that it's racy, it's not something that I think phylink should > care about, and should "softly" discourage it. So, I'm happy with > what phylink is doing here, and I suggest fixing the drivers for > this race. > > In any case, it should result in less code in the drivers - since > the work you need to do when the link goes down is a subset of the > work you need to do when the network interface is taken down. > While I agree with all of what written before, in practice, calling netif_carrier_off() when using PHYLIB can cause inconsistent carrier states at most, but it would not be messing the state machine itself because PHYLIB does not make uses of netif_carrier_ok() to make any decisions as whether the link has dropped or not, it bases its information solely on phydev->link. This is not true with PHYLINK, which is why the problem was observed here. -- Florian