Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756009AbdDGNGN (ORCPT ); Fri, 7 Apr 2017 09:06:13 -0400 Received: from mx1.redhat.com ([209.132.183.28]:46426 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753282AbdDGNGD (ORCPT ); Fri, 7 Apr 2017 09:06:03 -0400 DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com A1DEFC05AA4F Authentication-Results: ext-mx08.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com Authentication-Results: ext-mx08.extmail.prod.ext.phx2.redhat.com; spf=pass smtp.mailfrom=aarcange@redhat.com DKIM-Filter: OpenDKIM Filter v2.11.0 mx1.redhat.com A1DEFC05AA4F Date: Fri, 7 Apr 2017 15:06:00 +0200 From: Andrea Arcangeli To: Chris Wilson , Martin Kepplinger , Thorsten Leemhuis , daniel.vetter@intel.com, Dave Airlie , intel-gfx@lists.freedesktop.org, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org Subject: Re: [PATCH 2/5] i915: flush gem obj freeing workqueues to add accuracy to the i915 shrinker Message-ID: <20170407130600.GA5035@redhat.com> References: <87pogtplxr.fsf@intel.com> <20170406232347.988-1-aarcange@redhat.com> <20170406232347.988-3-aarcange@redhat.com> <20170407100211.GG10496@nuc-i3427.alporthouse.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170407100211.GG10496@nuc-i3427.alporthouse.com> User-Agent: Mutt/1.8.0 (2017-02-23) X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.32]); Fri, 07 Apr 2017 13:06:03 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3554 Lines: 68 On Fri, Apr 07, 2017 at 11:02:11AM +0100, Chris Wilson wrote: > On Fri, Apr 07, 2017 at 01:23:44AM +0200, Andrea Arcangeli wrote: > > Waiting a RCU grace period only guarantees the work gets queued, but > > until after the queued workqueue returns, there's no guarantee the > > memory was actually freed. So flush the work to provide better > > guarantees to the reclaim code in addition of waiting a RCU grace > > period to pass. > > We are not allowed to call flush_work() from the shrinker, the workqueue > doesn't have and can't have the right reclaim flags. I figured the flush_work had to be conditional to "unlock" being true too in the i915 shrinker (not only synchronize_rcu_expedited()), and I already fixed that bit, but I didn't think it would be a problem to wait for the workqueue as long as reclaim didn't recurse on the struct_mutex (it is a problem if unlock is false of course as we would be back to square one). I didn't get further hangs and I assume I've been running a couple of synchronize_rcu_expedited() and flush_work (I should add dynamic tracing to be sure). Also note, I didn't get any lockdep warning when I reproduced the workqueue hang in 4.11-rc5 so at least as far as lockdep is concerned there's no problem to call synchronize_rcu_expedited and it couldn't notice we were holding the struct_mutex while waiting for the new workqueue to run. Also note recursing on the lock (unlock false case) is something nothing else does, I'm not sure if it's worth the risk and if you shouldn't just call mutex_trylock in the shrinker instead of mutex_trylock_recursive. One thing was to recurse on the lock internally in the same context, but recursing through the whole reclaim is more dubious as safe. You could start dropping objects and wiping vmas and stuff in the middle of some kmalloc/alloc_pages that doesn't expect it and then crash for other reasons. So this reclaim recursion model of the shinker is quite unique and quite challenging to proof as safe if you keep using mutex_trylock_recursive in i915_gem_shrinker_scan. Lock recursion in all other places could be dropped without runtime downsides, the only place mutex_trylock_recursive makes a design difference and makes sense to be used is in i915_gem_shrinker_scan, the rest are implementation issues not fundamental shrinker design and it'd be nice if those other mutex_trylock_recursive would all be removed and the only one that is left is in i915_gem_shrinker_scan and nowhere else (or to drop it also from i915_gem_shrinker_scan). mutex_trylock_recursive() should also be patched to use READ_ONCE(__mutex_owner(lock)) because currently it breaks C. In the whole kernel i915 and msm drm are the only two users of such function in fact. Another thing is what value return from i915_gem_shrinker_scan when unlock is false, and we can't possibly wait for the memory to be freed let alone for a rcu grace period. For various reasons I think it's safer to return the current "free" even if we could also return "0" in such case. There are different tradeoffs, returning "free" is less likely to trigger an early OOM as the VM thinks it's still making progress and in fact it will get more free memory shortly, while returning SHRINK_STOP would also be an option and it would insist more on the other slabs so it would be more reliable at freeing memory timely, but it would be more at risk of early OOM. I think returning "free" is the better tradeoff of the two, but I suggest to add a comment as it's not exactly obvious what is better. Thanks, Andrea