Received: by 2002:a25:8b12:0:0:0:0:0 with SMTP id i18csp1766994ybl; Thu, 15 Aug 2019 00:33:04 -0700 (PDT) X-Google-Smtp-Source: APXvYqzOMyCf+NbFOmXU7Vrku7Iums5+30TQbxCJRVFS+7oJV7aDpKjl4918zlYZsqpraJHy1x15 X-Received: by 2002:a62:6c1:: with SMTP id 184mr3982978pfg.230.1565854384796; Thu, 15 Aug 2019 00:33:04 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1565854384; cv=none; d=google.com; s=arc-20160816; b=igcUKp3GoxuAZ9qnbxmGcOE3TdK4OxLVLKNXnxuDDgCNlLyKm/tgBT+X65ZvYd131m 6BA6DR4vX98V3o1SI9E7W7E9Q9Gw7DE/C4D81nvTxTClvHXOa9ws71juGSGYZEWtjV8D QjovOwRKpg1l7uWTuB+m+xGNzpTYe5fLChnmPwgxeKKi5pOqbbOAr+X8vwbUjmLqsPl1 3UHs0PL7h15JQvexdd2WfQHsTSYRlhuIv+DkSlyW0JtUPmJUgHt7NBbW0pnIhy5mummn qi6XJK+2l8SwuohUCNIdHxxwqT2QiGdMqrSJO2UEHGiBReEn1jiudT4nS8W+iY5APIjx +Awg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:mail-followup-to :message-id:subject:cc:to:from:date:dkim-signature; bh=9lqqkaS001tXTBszYzeiJH38HEfaSqFOiRWf4hWiQAE=; b=SMz+T497mPLN9rBm8qrQ3ERoC90Of5Q0jLVwHY192yugQ6nvRGDgj5W15v/IHiDxWz 2UZIIgqpCLuMvw2B3b9vy0mAKCtdwEHmoB3xj3g7zAoZENWbpF40Q7zPGP0lLAAEoILB qt2DXO3i5eRiu9lmBxDC9Mgw+wMK/+kd1XVUfoY82gAtpDIu4iejgtnYjhHjZf7mwxHf Zb0BbIcBGbrJLiOc83hzcBPGw+P9t04qQSZdUZeUILGYmGeGpGTU1q0UU1dG1sJNt0wZ vBTYh7BsN47iIvkGVVh2Ymj0mByWy2cNgDBvI3cS21ku4e3AlB5sUtgIOpa3ZyVvFE2s CbQw== ARC-Authentication-Results: i=1; mx.google.com; dkim=fail header.i=@ffwll.ch header.s=google header.b=kGKYX8WC; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id q13si501139pjb.13.2019.08.15.00.32.49; Thu, 15 Aug 2019 00:33:04 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=fail header.i=@ffwll.ch header.s=google header.b=kGKYX8WC; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730169AbfHOG6f (ORCPT + 99 others); Thu, 15 Aug 2019 02:58:35 -0400 Received: from mail-ed1-f67.google.com ([209.85.208.67]:39590 "EHLO mail-ed1-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729668AbfHOG6e (ORCPT ); Thu, 15 Aug 2019 02:58:34 -0400 Received: by mail-ed1-f67.google.com with SMTP id g8so1315078edm.6 for ; Wed, 14 Aug 2019 23:58:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ffwll.ch; s=google; h=sender:date:from:to:cc:subject:message-id:mail-followup-to :references:mime-version:content-disposition:in-reply-to:user-agent; bh=9lqqkaS001tXTBszYzeiJH38HEfaSqFOiRWf4hWiQAE=; b=kGKYX8WCtIWPm4IRO12hI1D0cjf/mI6eQP72qIiFFGtnNzwCY04Wmo7zjazUdqpw/d Del5Qla1OnO++32rcBWEVNsFe4PD3aQALBzRK4Su/mIZ4kIuX0VTEYN8IY+6aVsSvcOZ 7BJQBjHzNo13izdIP+UvcB7SJ2tHQ6yPSCE5g= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:date:from:to:cc:subject:message-id :mail-followup-to:references:mime-version:content-disposition :in-reply-to:user-agent; bh=9lqqkaS001tXTBszYzeiJH38HEfaSqFOiRWf4hWiQAE=; b=OT4ufvkXhB3O+kUrEQFxMJUrO3dhhYAPQcJ3ExbiHUavHbve/b8KBGd9vqHX79OF/B sTw4/L0+iNPG4w/AFQDypcbg/rUyTBs/dTspYmMT7/zs9I97N5+H16DnJBh9Ewm1ezU5 qAS+e/3SDYJR5/rwsqREAM8LFw3j99vSgktRHlDEBDRXfa+6EF//xB5UMCTdOWOIHvMe 7w7ywkBdRIbNUvCwAntW3jXNgXyB+6HRQxXnuh6Pxbhm57sr95BDX5/dX9eARUuBIKdK 8VLDtQnjpz01WB+6xooRjogJvjqMzxDhDfvKdxHmSHegxHZ0TdimnXGM/fm3fLJy4CS4 a8WQ== X-Gm-Message-State: APjAAAWrKPNFOVoMEWa2/mpxyrqDEySpDksjHxxQYSu3g5fZM1kFnva1 fKLnTTsH7zWJOL0WSs4cJtjdWg== X-Received: by 2002:aa7:d981:: with SMTP id u1mr3719744eds.150.1565852312360; Wed, 14 Aug 2019 23:58:32 -0700 (PDT) Received: from phenom.ffwll.local ([2a02:168:569e:0:3106:d637:d723:e855]) by smtp.gmail.com with ESMTPSA id x11sm252024eju.26.2019.08.14.23.58.30 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 14 Aug 2019 23:58:31 -0700 (PDT) Date: Thu, 15 Aug 2019 08:58:29 +0200 From: Daniel Vetter To: Jason Gunthorpe Cc: Daniel Vetter , LKML , linux-mm@kvack.org, DRI Development , Intel Graphics Development , Peter Zijlstra , Ingo Molnar , Andrew Morton , Michal Hocko , David Rientjes , Christian =?iso-8859-1?Q?K=F6nig?= , =?iso-8859-1?B?Suly9G1l?= Glisse , Masahiro Yamada , Wei Wang , Andy Shevchenko , Thomas Gleixner , Jann Horn , Feng Tang , Kees Cook , Randy Dunlap , Daniel Vetter Subject: Re: [PATCH 2/5] kernel.h: Add non_block_start/end() Message-ID: <20190815065829.GA7444@phenom.ffwll.local> Mail-Followup-To: Jason Gunthorpe , LKML , linux-mm@kvack.org, DRI Development , Intel Graphics Development , Peter Zijlstra , Ingo Molnar , Andrew Morton , Michal Hocko , David Rientjes , Christian =?iso-8859-1?Q?K=F6nig?= , =?iso-8859-1?B?Suly9G1l?= Glisse , Masahiro Yamada , Wei Wang , Andy Shevchenko , Thomas Gleixner , Jann Horn , Feng Tang , Kees Cook , Randy Dunlap , Daniel Vetter References: <20190814202027.18735-1-daniel.vetter@ffwll.ch> <20190814202027.18735-3-daniel.vetter@ffwll.ch> <20190814235805.GB11200@ziepe.ca> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20190814235805.GB11200@ziepe.ca> X-Operating-System: Linux phenom 4.19.0-5-amd64 User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Aug 14, 2019 at 08:58:05PM -0300, Jason Gunthorpe wrote: > On Wed, Aug 14, 2019 at 10:20:24PM +0200, Daniel Vetter wrote: > > In some special cases we must not block, but there's not a > > spinlock, preempt-off, irqs-off or similar critical section already > > that arms the might_sleep() debug checks. Add a non_block_start/end() > > pair to annotate these. > > > > This will be used in the oom paths of mmu-notifiers, where blocking is > > not allowed to make sure there's forward progress. Quoting Michal: > > > > "The notifier is called from quite a restricted context - oom_reaper - > > which shouldn't depend on any locks or sleepable conditionals. The code > > should be swift as well but we mostly do care about it to make a forward > > progress. Checking for sleepable context is the best thing we could come > > up with that would describe these demands at least partially." > > But this describes fs_reclaim_acquire() - is there some reason we are > conflating fs_reclaim with non-sleeping? No idea why you tie this into fs_reclaim. We can definitly sleep in there, and for e.g. kswapd (which also wraps everything in fs_reclaim) we're event supposed to I thought. To make sure we can get at the last bit of memory by flushing all the queues and waiting for everything to be cleaned out. > ie is there some fundamental difference between the block stack > sleeping during reclaim while it waits for a driver to write out a > page and a GPU driver sleeping during OOM while it waits for it's HW > to fence DMA on a page? > > Fundamentally we have invalidate_range_start() vs invalidate_range() > as the start() version is able to sleep. If drivers can do their work > without sleeping then they should be using invalidare_range() instead. > > Thus, it doesn't seem to make any sense to ask a driver that requires a > sleeping API not to sleep. > > AFAICT what is really going on here is that drivers care about only a > subset of the VA space, and we want to query the driver if it cares > about the range proposed to be OOM'd, so we can OOM ranges that are > do not have SPTEs. > > ie if you look pretty much all drivers do exactly as > userptr_mn_invalidate_range_start() does, and bail once they detect > the VA range is of interest. > > So, I'm working on a patch to lift the interval tree into the notifier > core and then do the VA test OOM needs without bothering the > driver. Drivers can retain the blocking API they require and OOM can > work on VA's that don't have SPTEs. Hm I figured the point of hmm_mirror is to have that interval tree for everyone (among other things). But yeah lifting to mmu_notifier sounds like a clean solution for this, but I really have not much clue about why we even have this for special mode in the oom case. I'm just trying to increase the odds that drivers hold up their end of the bargain. > This approach also solves the critical bug in this path: > https://lore.kernel.org/linux-mm/20190807191627.GA3008@ziepe.ca/ > > And solves a bunch of other bugs in the drivers. > > > Peter also asked whether we want to catch spinlocks on top, but Michal > > said those are less of a problem because spinlocks can't have an > > indirect dependency upon the page allocator and hence close the loop > > with the oom reaper. > > Again, this entirely sounds like fs_reclaim - isn't that exactly what > it is for? > > I have had on my list a second and very related possible bug. I ran > into commit 35cfa2b0b491 ("mm/mmu_notifier: allocate mmu_notifier in > advance") which says that mapping->i_mmap_mutex is under fs_reclaim(). > > We do hold i_mmap_rwsem while calling invalidate_range_start(): > > unmap_mapping_pages > i_mmap_lock_write(mapping); // ie i_mmap_rwsem > unmap_mapping_range_tree > unmap_mapping_range_vma > zap_page_range_single > mmu_notifier_invalidate_range_start > > So, if it is still true that i_mmap_rwsem is under fs_reclaim then > invalidate_range_start is *always* under fs_reclaim anyhow! (this I do > not know) > > Thus we should use lockdep to force this and fix all the drivers. > > .. and if we force fs_reclaim always, do we care about blockable > anymore?? Still not sure what fs_reclaim matters here. One of the later patches steals the fs_reclaim idea for mmu_notifiers, but that ties together completely different paths. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch