Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S3000630AbdD3JoA (ORCPT ); Sun, 30 Apr 2017 05:44:00 -0400 Received: from mx1.redhat.com ([209.132.183.28]:53066 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1427327AbdD3Jnv (ORCPT ); Sun, 30 Apr 2017 05:43:51 -0400 DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com E53013B708 Authentication-Results: ext-mx06.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com Authentication-Results: ext-mx06.extmail.prod.ext.phx2.redhat.com; spf=pass smtp.mailfrom=aarcange@redhat.com DKIM-Filter: OpenDKIM Filter v2.11.0 mx1.redhat.com E53013B708 Date: Sun, 30 Apr 2017 11:43:48 +0200 From: Andrea Arcangeli To: "J. R. Okajima" Cc: joonas.lahtinen@linux.intel.com, chris@chris-wilson.co.uk, daniel.vetter@ffwll.ch, jani.nikula@intel.com, linux-kernel@vger.kernel.org Subject: Re: Q. drm/i915 shrinker, synchronize_rcu_expedited() from handlers Message-ID: <20170430094348.GA5970@redhat.com> References: <7743.1493532478@jrobl> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <7743.1493532478@jrobl> User-Agent: Mutt/1.8.2 (2017-04-18) X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.30]); Sun, 30 Apr 2017 09:43:51 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 1990 Lines: 43 On Sun, Apr 30, 2017 at 03:07:58PM +0900, J. R. Okajima wrote: > Hello, > > Since v4.11-rc7 I can see the workqueue stops on my development/test system. > Git-bisecting tells me the suspicious commit is > c053b5a 2017-04-11 drm/i915: Don't call synchronize_rcu_expedited under struct_mutex > > I am not sure whether this is the real cause or not of my problem, but I > have a question. > By the commit, the shrinker handlers ->scan_objects() and > ->count_objects() both calls synchronize_rcu_expedited() > unconditionally. Is it a legal RCU bahavour? It's actually not legal because the workqueue RCU uses is not reclaim safe, simply lockdep is unable to notice it because RCU won't use flush_workqueue to wait for completion but it waits a wakeup from the workqueue instead (and lockdep can't possibly notice that). To fix it and allow both RCU and flush_workqueue (both needed to wait the memory to be freed), i915 and RCU must both start using their own private workqueue and not share the system-wide one and then set WQ_MEM_RECLAIM on their private workqueue. (and of course they should only call it when they're not recursing on the struct mutex) The alternative is to drop it all and behave like in the mutex recursion case. However reclaim cannot possibly throttle on the memory freeing externally unless RCU is changed to stop using the system workqueue so the idea that the throttling can be offloaded to the reclaim code doesn't move the needle in terms of being able to throttle. Perhaps no throttling is necessary at all and we can just go inaccurate and it'll work fine though. > I know dev->struct_mutex is unlocked now, but before the commit, these > two handlers were not calling synchronize_rcu_expedited(). Yes I already reported this, my original fix was way more efficient (and also safer considering the above) than what landed upstream. My feedback was ignored though. https://lists.freedesktop.org/archives/intel-gfx/2017-April/125414.html Thanks, Andrea