Received: by 2002:a05:7412:3784:b0:e2:908c:2ebd with SMTP id jk4csp1953626rdb; Tue, 3 Oct 2023 06:19:47 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEinpJ7r3mTIb6GPKeLAk1GjMTOacOS8J6NzB67NT/efjvMgLpPHh40IdjkJ4oOcLD1Cf3c X-Received: by 2002:a17:902:ce92:b0:1bf:27a2:b52b with SMTP id f18-20020a170902ce9200b001bf27a2b52bmr18860931plg.58.1696339187333; Tue, 03 Oct 2023 06:19:47 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1696339187; cv=none; d=google.com; s=arc-20160816; b=IK+1iBvVpEPhTuohf8bSNZZmOSD9pZT9h3T95XYxvVtcLC9GpHlusteNMXNB+JwLQq BcWOS4UBUFaXWJlo7GYbShibTQGg3r48CZTyenzWmOWUEW7T3u0GNg7rZ9Mv02qguM57 UGFd0lU+4KxXC9DHZltQ4Z0lVFJhIizSiMgChL5SFq2+Jb7XO90zwDoTgA5fdZ1ZTA8M z1bzL1FV4wiFFPrzDhRIWynsG275amcGuB+hO8aJdvaFxdffDqXJD9PZESpswFW3HAGr yr89O7vXHsqGv4am4dAOtXSOqU9XXWsJptvPXLZCQ5/TKi6HdxV/BOdx1TVV90DlAji3 4uIA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature :dkim-signature; bh=LqjOkdf8wEGORSfxlGDuLFjsaB2EonO8sU3/o8407WE=; fh=AJq6fOzBFVmx73gpyiQEeL2QptW2TwogVske581TpYk=; b=J+B2SUCH3Gl8Fp58tzoXJriKpPnXBAMgOtWA7Azc434OXcus1f8s9jZQqB1jPNIAN/ kh59yDUHaRIwBqjCT03vjw61tSWg2jfgQbaeESAx2WPKeYeZksiODEfbkLnym/w1iQvi aj6aqnDOQ5fN79izE/dW+XQLED2GKYC2xDKj7i7IAkj0JbtFKBuYP/LUBCxuOawE0Wro kbGezEAeThciaocKyRVOlIsIeHARpSW2zoNOwuISDZ/f73nloj2bEVAytDlaUpQPxtp4 qV4slhm0Sd+4Si9d84oeH5jrtV8u5IK7zw6ngJ4EuFCwqIRBtiAKfGvBuPxpSSyrRZWW 2P9Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@suse.cz header.s=susede2_rsa header.b="0L1N84/g"; dkim=neutral (no key) header.i=@suse.cz header.s=susede2_ed25519 header.b=SBF+rWXc; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from howler.vger.email (howler.vger.email. [23.128.96.34]) by mx.google.com with ESMTPS id z20-20020a170903409400b001bef085a37dsi1325077plc.86.2023.10.03.06.19.46 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 03 Oct 2023 06:19:47 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) client-ip=23.128.96.34; Authentication-Results: mx.google.com; dkim=pass header.i=@suse.cz header.s=susede2_rsa header.b="0L1N84/g"; dkim=neutral (no key) header.i=@suse.cz header.s=susede2_ed25519 header.b=SBF+rWXc; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by howler.vger.email (Postfix) with ESMTP id 71FF2828F4F4; Tue, 3 Oct 2023 06:19:45 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at howler.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237843AbjJCNTe (ORCPT + 99 others); Tue, 3 Oct 2023 09:19:34 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34994 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235929AbjJCNTC (ORCPT ); Tue, 3 Oct 2023 09:19:02 -0400 Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7F804E3; Tue, 3 Oct 2023 06:18:55 -0700 (PDT) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id 29A0C2189B; Tue, 3 Oct 2023 13:18:54 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1696339134; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=LqjOkdf8wEGORSfxlGDuLFjsaB2EonO8sU3/o8407WE=; b=0L1N84/g4JW4Pz4EHuBiDWpCHedzO4xCWFk93E412HzHjVHMG2caaqbBmKjd54UvzEKNW3 XzqOvxVn5fmUsDC+LZAx5acFvUUAydhayTqik7QMcHtRVwZOMHjUEka2gvCaYJEOG4zliA CCcjwTM5k6bjfVExditBMu97suSKm1E= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1696339134; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=LqjOkdf8wEGORSfxlGDuLFjsaB2EonO8sU3/o8407WE=; b=SBF+rWXcUn9e/501dNskOYmABTAkWJTPRFeceh/Zvkr8VgH4m0t+8/qME8tZhGLLf9Ho6y mRFWnMNlgk0DB+BQ== Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 1BF73132D4; Tue, 3 Oct 2023 13:18:54 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id CIjIBr4UHGXpMwAAMHmgww (envelope-from ); Tue, 03 Oct 2023 13:18:54 +0000 Received: by quack3.suse.cz (Postfix, from userid 1000) id A9412A07CC; Tue, 3 Oct 2023 15:18:53 +0200 (CEST) Date: Tue, 3 Oct 2023 15:18:53 +0200 From: Jan Kara To: Hugh Dickins Cc: Andrew Morton , Christian Brauner , Carlos Maiolino , Chuck Lever , Jan Kara , Matthew Wilcox , Johannes Weiner , Axel Rasmussen , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH 3/8] shmem: factor shmem_falloc_wait() out of shmem_fault() Message-ID: <20231003131853.ramdlfw5s6ne4iqx@quack3> References: <6fe379a4-6176-9225-9263-fe60d2633c0@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <6fe379a4-6176-9225-9263-fe60d2633c0@google.com> X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED,SPF_HELO_NONE, SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (howler.vger.email [0.0.0.0]); Tue, 03 Oct 2023 06:19:45 -0700 (PDT) On Fri 29-09-23 20:27:53, Hugh Dickins wrote: > That Trinity livelock shmem_falloc avoidance block is unlikely, and a > distraction from the proper business of shmem_fault(): separate it out. > (This used to help compilers save stack on the fault path too, but both > gcc and clang nowadays seem to make better choices anyway.) > > Signed-off-by: Hugh Dickins Looks good. Feel free to add: Reviewed-by: Jan Kara Looking at the code I'm just wondering whether the livelock with shmem_undo_range() couldn't be more easy to avoid by making shmem_undo_range() always advance the index by 1 after evicting a page and thus guaranteeing a forward progress... Because the forward progress within find_get_entries() is guaranteed these days, it should be enough. Honza > --- > mm/shmem.c | 126 +++++++++++++++++++++++++++++------------------------ > 1 file changed, 69 insertions(+), 57 deletions(-) > > diff --git a/mm/shmem.c b/mm/shmem.c > index 824eb55671d2..5501a5bc8d8c 100644 > --- a/mm/shmem.c > +++ b/mm/shmem.c > @@ -2148,87 +2148,99 @@ int shmem_get_folio(struct inode *inode, pgoff_t index, struct folio **foliop, > * entry unconditionally - even if something else had already woken the > * target. > */ > -static int synchronous_wake_function(wait_queue_entry_t *wait, unsigned mode, int sync, void *key) > +static int synchronous_wake_function(wait_queue_entry_t *wait, > + unsigned int mode, int sync, void *key) > { > int ret = default_wake_function(wait, mode, sync, key); > list_del_init(&wait->entry); > return ret; > } > > +/* > + * Trinity finds that probing a hole which tmpfs is punching can > + * prevent the hole-punch from ever completing: which in turn > + * locks writers out with its hold on i_rwsem. So refrain from > + * faulting pages into the hole while it's being punched. Although > + * shmem_undo_range() does remove the additions, it may be unable to > + * keep up, as each new page needs its own unmap_mapping_range() call, > + * and the i_mmap tree grows ever slower to scan if new vmas are added. > + * > + * It does not matter if we sometimes reach this check just before the > + * hole-punch begins, so that one fault then races with the punch: > + * we just need to make racing faults a rare case. > + * > + * The implementation below would be much simpler if we just used a > + * standard mutex or completion: but we cannot take i_rwsem in fault, > + * and bloating every shmem inode for this unlikely case would be sad. > + */ > +static vm_fault_t shmem_falloc_wait(struct vm_fault *vmf, struct inode *inode) > +{ > + struct shmem_falloc *shmem_falloc; > + struct file *fpin = NULL; > + vm_fault_t ret = 0; > + > + spin_lock(&inode->i_lock); > + shmem_falloc = inode->i_private; > + if (shmem_falloc && > + shmem_falloc->waitq && > + vmf->pgoff >= shmem_falloc->start && > + vmf->pgoff < shmem_falloc->next) { > + wait_queue_head_t *shmem_falloc_waitq; > + DEFINE_WAIT_FUNC(shmem_fault_wait, synchronous_wake_function); > + > + ret = VM_FAULT_NOPAGE; > + fpin = maybe_unlock_mmap_for_io(vmf, NULL); > + shmem_falloc_waitq = shmem_falloc->waitq; > + prepare_to_wait(shmem_falloc_waitq, &shmem_fault_wait, > + TASK_UNINTERRUPTIBLE); > + spin_unlock(&inode->i_lock); > + schedule(); > + > + /* > + * shmem_falloc_waitq points into the shmem_fallocate() > + * stack of the hole-punching task: shmem_falloc_waitq > + * is usually invalid by the time we reach here, but > + * finish_wait() does not dereference it in that case; > + * though i_lock needed lest racing with wake_up_all(). > + */ > + spin_lock(&inode->i_lock); > + finish_wait(shmem_falloc_waitq, &shmem_fault_wait); > + } > + spin_unlock(&inode->i_lock); > + if (fpin) { > + fput(fpin); > + ret = VM_FAULT_RETRY; > + } > + return ret; > +} > + > static vm_fault_t shmem_fault(struct vm_fault *vmf) > { > - struct vm_area_struct *vma = vmf->vma; > - struct inode *inode = file_inode(vma->vm_file); > + struct inode *inode = file_inode(vmf->vma->vm_file); > gfp_t gfp = mapping_gfp_mask(inode->i_mapping); > struct folio *folio = NULL; > + vm_fault_t ret = 0; > int err; > - vm_fault_t ret = VM_FAULT_LOCKED; > > /* > * Trinity finds that probing a hole which tmpfs is punching can > - * prevent the hole-punch from ever completing: which in turn > - * locks writers out with its hold on i_rwsem. So refrain from > - * faulting pages into the hole while it's being punched. Although > - * shmem_undo_range() does remove the additions, it may be unable to > - * keep up, as each new page needs its own unmap_mapping_range() call, > - * and the i_mmap tree grows ever slower to scan if new vmas are added. > - * > - * It does not matter if we sometimes reach this check just before the > - * hole-punch begins, so that one fault then races with the punch: > - * we just need to make racing faults a rare case. > - * > - * The implementation below would be much simpler if we just used a > - * standard mutex or completion: but we cannot take i_rwsem in fault, > - * and bloating every shmem inode for this unlikely case would be sad. > + * prevent the hole-punch from ever completing: noted in i_private. > */ > if (unlikely(inode->i_private)) { > - struct shmem_falloc *shmem_falloc; > - > - spin_lock(&inode->i_lock); > - shmem_falloc = inode->i_private; > - if (shmem_falloc && > - shmem_falloc->waitq && > - vmf->pgoff >= shmem_falloc->start && > - vmf->pgoff < shmem_falloc->next) { > - struct file *fpin; > - wait_queue_head_t *shmem_falloc_waitq; > - DEFINE_WAIT_FUNC(shmem_fault_wait, synchronous_wake_function); > - > - ret = VM_FAULT_NOPAGE; > - fpin = maybe_unlock_mmap_for_io(vmf, NULL); > - if (fpin) > - ret = VM_FAULT_RETRY; > - > - shmem_falloc_waitq = shmem_falloc->waitq; > - prepare_to_wait(shmem_falloc_waitq, &shmem_fault_wait, > - TASK_UNINTERRUPTIBLE); > - spin_unlock(&inode->i_lock); > - schedule(); > - > - /* > - * shmem_falloc_waitq points into the shmem_fallocate() > - * stack of the hole-punching task: shmem_falloc_waitq > - * is usually invalid by the time we reach here, but > - * finish_wait() does not dereference it in that case; > - * though i_lock needed lest racing with wake_up_all(). > - */ > - spin_lock(&inode->i_lock); > - finish_wait(shmem_falloc_waitq, &shmem_fault_wait); > - spin_unlock(&inode->i_lock); > - > - if (fpin) > - fput(fpin); > + ret = shmem_falloc_wait(vmf, inode); > + if (ret) > return ret; > - } > - spin_unlock(&inode->i_lock); > } > > + WARN_ON_ONCE(vmf->page != NULL); > err = shmem_get_folio_gfp(inode, vmf->pgoff, &folio, SGP_CACHE, > gfp, vmf, &ret); > if (err) > return vmf_error(err); > - if (folio) > + if (folio) { > vmf->page = folio_file_page(folio, vmf->pgoff); > + ret |= VM_FAULT_LOCKED; > + } > return ret; > } > > -- > 2.35.3 > -- Jan Kara SUSE Labs, CR