Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751951AbdIMSip (ORCPT ); Wed, 13 Sep 2017 14:38:45 -0400 Received: from mx2.suse.de ([195.135.220.15]:41825 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1751148AbdIMSin (ORCPT ); Wed, 13 Sep 2017 14:38:43 -0400 Date: Wed, 13 Sep 2017 20:38:09 +0200 From: "Luis R. Rodriguez" To: Linus Torvalds , "Rafael J. Wysocki" , Marcel Holtmann Cc: Greg Kroah-Hartman , "Luis R. Rodriguez" , Linux Kernel Mailing List , stable , Gabriel C , Arend Van Spriel Subject: Re: [PATCH 4.13 20/27] Revert "firmware: add sanity check on shutdown/suspend" Message-ID: <20170913183809.GN16216@wotan.suse.de> References: <20170912165308.904472972@linuxfoundation.org> <20170912165310.407586301@linuxfoundation.org> <20170912172008.GF16216@wotan.suse.de> <20170913004758.GA21439@kroah.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.6.0 (2016-04-01) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5921 Lines: 129 Rafeal a question for you below. On Tue, Sep 12, 2017 at 09:11:46PM -0700, Linus Torvalds wrote: > On Tue, Sep 12, 2017 at 5:47 PM, Greg Kroah-Hartman > wrote: > > > >> If reverting this commit please consider reverting also commit > >> 06a45a93e7d34a ("firmware: move umh try locks into the umh code"). > > > > Ok, I can queue that revert up in my tree and will send it to Linus once > > 4.14-rc1 is out. > > I want to see a _reason_ for that revert. The two have absolutely > nothing to do with each other., Reverting one is *not* a reason for > reverting the other. There is a dependency between both commits, the reason is not obvious though. I'll explain below. > Commit 06a45a93e7d34a seems to be a cleanup. The arguments in > 06a45a93e7d3 ("firmware: move umh try locks into the umh code") seem > valid, and there's no real reason to worry about that FW_OPT_NOWAIT > etc for the direct-from-filesystem loading. That's simply not > sensible. Indeed! That stupid UMH lock *seems* wrong on the direct filesystem path! Hence these changes. The devil is in the details though. That UMH lock however carried an implicit suspend guard, the "cleanup" actually then has a functional change. The commit which was reverted provided the safe guard in generic form, in case we already had become dependent on the suspend guard. This UMH lock on the direct FS path then added an implicit "arbitrary rule", as you put it, on the firmware API. Commit 81f95076281fd clarified and made it explicit just to be safe. *Only* carrying 06a45a93e7d34a then was a long term goal, eventually I intended to remove the code which you removed, but I would have preferred to have waited at least a release. By only having 06a45a93e7d34a we then loose the suspend guard from the direct FS lookup path immediately. If we're happy to live with that right away then great, this may be one of those *random arbitrary rules* inherited worth removing, it was a long term goal to remove that code, however I just want folks to be very well aware of the original goal: 06a45a93e7d34a without 81f95076281fd is not a cleanup, there is a hidden functional change there. So to be clear: I'm happy with 81f95076281fd reverted as it was my long term goal, however I would not be doing my job if I were not explaining that 06a45a93e7d34a *has* a functional change and that 81f95076281fd was an attempt at a safe baby step forward. > That whole FW_OPT_NOWAIT thing only makes sense for the actual > user-mode helper, so that commit actually seems to move the testing > and the logic to a place that really does make sense. Indeed! > So why would we revert a commit that makes SENSE? Because although the UMH lock should only be used for code which needs the UMH helpers, the UMH lock also had added a suspend guard which we grew to rely on. Removing it could mean allowing races on resume, and there was no clear way generically detecting this. The UMH lock was *also* never used on *any other* UMH code, so from the UMH perspective it also begs the question if the other UMH code is error prone as it lacks the safe UMH locking safe guards. > For example, what might be senseible is to add a warning that tries to > verify that people do *not* do firmware loads from the ->resume() > callbacks. The firmware API *allows* for resume() callbacks to use the firmware API in ways which *should* not block much if any at all, however they must have first at least called the firmware API once so that the device gets a devres entry with a string associated; this is the firmware cache. Then upon suspend the firmware API requests for each of these firmwares. If a resume() callback then calls the firmware API for any of these files it would work without issue as the firmware is loaded in the cache, as simple pointer reference. Processing and consuming the firmware on the driver though *can* take a while and block longer. Because of the firmware cache implementation then we needed a whitelist, so the check which I implemented in 81f95076281fd *only* complains if we've passed the firmware cache check, and we *know* then the incoming call is new. > But then it should literally check *that*: it could do > something like > > WARN_ON_ONCE(current == resume_thread, "Firmware loading > called synchronously during resume"); Sure that would be *much* smaller code. Rafael do you have helpers for this sort of thing or are OK with them being added? > or whatever, exactly because it's obviously *not* ok to block the same > process that is going to resume all the other devices that might be > *needed* for the firmware loading. Your interest in this seems to be the blocking implications on synchronous calls. That's certainly a new concern I had not heard raised yet and is worthwhile addressing. Are you not concerned in any way though about the filesystem being ready? > But on the other hand, if somebody then does an independent thread to > resume their firmware, we could just block to wait for it - it > wouldn't be the same kind of chicken-and-egg issue with IO at resume > time. > > So instead of "arbitrary rules", there should be things that actually > make sense. The arbitrary rule here was hidden underneath code Rafael added years ago. I'm pretty sure he did not intend to add the check on the direct FS path as direct FS path came later as the code evolved, *but* we kept the lock on the direct FS path so we have no option but to review carefully its removal. > The commit that Luis now argues for _also_ reverting makes a lot of > sense to me to keep. I'm not seeing why that should be reverted, when > the _only_ reason seems to be some spiteful "well, if you reverted one > commit, you should randomly revert another one too". I hope to have clarified there is no spite here, I'm simply being very careful to avoid a regression with a real hidden functional change. Luis