Received: by 2002:a25:824b:0:0:0:0:0 with SMTP id d11csp1769941ybn; Thu, 26 Sep 2019 01:51:06 -0700 (PDT) X-Google-Smtp-Source: APXvYqyTZI6VUsAlw1ZYt2r1zquIS45KDk9JvwgcuYvNWV4DcBboDmwGj1uDIiXeQBKYmggKAXd6 X-Received: by 2002:a50:8d5e:: with SMTP id t30mr2363620edt.112.1569487866869; Thu, 26 Sep 2019 01:51:06 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1569487866; cv=none; d=google.com; s=arc-20160816; b=0so7ENpcAwozWOk+Kamlm+KKIDO9wI9qB7WhZqJ8vdug9YZ/zSycNU6ZSk8e/W10wy uGs4Hr32Vui03inxqUmVircsEjXudLy/nl0lPprKlcu036OvUq1uGLFGsQvNFFy+mCbn 8zBiemTGw46Ewudt6+AIk0+ACdMRrUvcOaFoN2C6v4vCFSnV37Dzv6F5f5YKiQZdUInm aXs8M5NSLrMWzZ/hCmCxT2s7m8RjHu9zgzvgw3bXqV7+3Zus2/Ci4NNSv25rEfBxWeQ0 iBw6x50lytwRUGHI8M45LUwdwqYv2OJUOls6NmaXJmDGzHSvXUFheZ4/2JByO+s9fe0b 4uSQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature; bh=Z0xiBs2cJrYnnyGrqBU9K+45izSjXuYKxVFKoNAvjMA=; b=Xlhn6zNkwG6ywaRVxzSxBQwLql7hYp76GjpDho9kHwPO1AnfIi4zrEzifmcA7KeRn0 LWuIz4/cB1RC/pCE1CrsA+shoEMcwlQvATbB0hB4o06gC1LH7bduGay6vjRwNYRMuw0I w/MexEBaFNDfR9/tNTzGPPo46j+A/YHjj/n65NdD4vyGQtGuGV1hqAVwKAenvATdXhSq zIqXk3WWbAhFmOX2/4DDYxSH8JlvuQqILotoD2HdzOr7Q7mp6beHj+EvDztf3JRILw1A R9VU5OWfDCJMbF0+B3w0UqY3AuCQaAuvl7MtrVG+UGuR/vVW/n817BSE+JRKmCqFAFyP jfyw== ARC-Authentication-Results: i=1; mx.google.com; dkim=fail header.i=@infradead.org header.s=bombadil.20170209 header.b=XIZg3vu6; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id v5si944503ede.126.2019.09.26.01.50.43; Thu, 26 Sep 2019 01:51:06 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=fail header.i=@infradead.org header.s=bombadil.20170209 header.b=XIZg3vu6; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2389465AbfIXUqL (ORCPT + 99 others); Tue, 24 Sep 2019 16:46:11 -0400 Received: from bombadil.infradead.org ([198.137.202.133]:43536 "EHLO bombadil.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726236AbfIXUqL (ORCPT ); Tue, 24 Sep 2019 16:46:11 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=bombadil.20170209; h=In-Reply-To:Content-Type:MIME-Version :References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id: List-Help:List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=Z0xiBs2cJrYnnyGrqBU9K+45izSjXuYKxVFKoNAvjMA=; b=XIZg3vu6BOS2FeDU0hiloQmee A85KCXeh4u+VpfP5n4uHuxhCOZ+FFpBRSKoUHHTzdEU4DKkQb2nW/WLoTQMzmwN/s1VtrCpbKFpXk 230t6233IkeEJ4fLik0cJR4nKIAAo6HPencihsIMm1OeMGCNM6k0UfgJ1Ht5kKwkhFKO/NztZCPAP VMV1+xPKuD2KJ3d1RWn1UswhxfZaTluOtCykoxZLoLaqkWaQfhGdCUh8MAmx9ircGrkfMVcZ2DHoD nUwbQ83/pi1fXpF0iCRCFavaaDy0tPJhtJwIHdkLDvgdZDOrR0wqSJOAepPkEfdH/NdNAohSz9qiJ Fk0YuoHbA==; Received: from willy by bombadil.infradead.org with local (Exim 4.92.2 #3 (Red Hat Linux)) id 1iCrhI-00045Q-F4; Tue, 24 Sep 2019 20:46:08 +0000 Date: Tue, 24 Sep 2019 13:46:08 -0700 From: Matthew Wilcox To: Johannes Weiner Cc: Andrew Morton , Josef Bacik , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH] mm: drop mmap_sem before calling balance_dirty_pages() in write fault Message-ID: <20190924204608.GI1855@bombadil.infradead.org> References: <20190924171518.26682-1-hannes@cmpxchg.org> <20190924174809.GH1855@bombadil.infradead.org> <20190924194238.GA29030@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20190924194238.GA29030@cmpxchg.org> User-Agent: Mutt/1.12.1 (2019-06-15) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Sep 24, 2019 at 03:42:38PM -0400, Johannes Weiner wrote: > > I'm not a fan of moving file_update_time() to _before_ the > > balance_dirty_pages call. > > Can you elaborate why? If the filesystem has a page_mkwrite op, it > will have already called file_update_time() before this function is > entered. If anything, this change makes the sequence more consistent. Oh, that makes sense. I thought it should be updated after all the data was written, but it probably doesn't make much difference. > > Also, this is now the third place that needs > > maybe_unlock_mmap_for_io, see > > https://lore.kernel.org/linux-mm/20190917120852.x6x3aypwvh573kfa@box/ > > Good idea, I moved the helper to internal.h and converted to it. > > I left the shmem site alone, though. It doesn't require the file > pinning, so it shouldn't pointlessly bump the file refcount and > suggest such a dependency - that could cost somebody later quite a bit > of time trying to understand the code. The problem for shmem is this: spin_unlock(&inode->i_lock); schedule(); spin_lock(&inode->i_lock); finish_wait(shmem_falloc_waitq, &shmem_fault_wait); spin_unlock(&inode->i_lock); While scheduled, the VMA can go away and the inode be reclaimed, making this a use-after-free. The initial suggestion was an increment on the inode refcount, but since we already have a pattern which involves pinning the file, I thought that was a better way to go. > From: Johannes Weiner > Date: Wed, 8 May 2019 13:53:38 -0700 > Subject: [PATCH v2] mm: drop mmap_sem before calling balance_dirty_pages() > in write fault > > One of our services is observing hanging ps/top/etc under heavy write > IO, and the task states show this is an mmap_sem priority inversion: > > A write fault is holding the mmap_sem in read-mode and waiting for > (heavily cgroup-limited) IO in balance_dirty_pages(): > > [<0>] balance_dirty_pages+0x724/0x905 > [<0>] balance_dirty_pages_ratelimited+0x254/0x390 > [<0>] fault_dirty_shared_page.isra.96+0x4a/0x90 > [<0>] do_wp_page+0x33e/0x400 > [<0>] __handle_mm_fault+0x6f0/0xfa0 > [<0>] handle_mm_fault+0xe4/0x200 > [<0>] __do_page_fault+0x22b/0x4a0 > [<0>] page_fault+0x45/0x50 > [<0>] 0xffffffffffffffff > > Somebody tries to change the address space, contending for the > mmap_sem in write-mode: > > [<0>] call_rwsem_down_write_failed_killable+0x13/0x20 > [<0>] do_mprotect_pkey+0xa8/0x330 > [<0>] SyS_mprotect+0xf/0x20 > [<0>] do_syscall_64+0x5b/0x100 > [<0>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 > [<0>] 0xffffffffffffffff > > The waiting writer locks out all subsequent readers to avoid lock > starvation, and several threads can be seen hanging like this: > > [<0>] call_rwsem_down_read_failed+0x14/0x30 > [<0>] proc_pid_cmdline_read+0xa0/0x480 > [<0>] __vfs_read+0x23/0x140 > [<0>] vfs_read+0x87/0x130 > [<0>] SyS_read+0x42/0x90 > [<0>] do_syscall_64+0x5b/0x100 > [<0>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 > [<0>] 0xffffffffffffffff > > To fix this, do what we do for cache read faults already: drop the > mmap_sem before calling into anything IO bound, in this case the > balance_dirty_pages() function, and return VM_FAULT_RETRY. > > Signed-off-by: Johannes Weiner Reviewed-by: Matthew Wilcox (Oracle)