Received: by 2002:a05:6a10:206:0:0:0:0 with SMTP id 6csp1756302pxj; Wed, 19 May 2021 13:11:19 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxwXdaFWQ9BOzaHaoOLLQc50NFnNQFYMf1xNH+kYSTfiPYAOzql4/JB8xodxn4THFldsKWs X-Received: by 2002:a05:6e02:1baf:: with SMTP id n15mr905315ili.148.1621455079046; Wed, 19 May 2021 13:11:19 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1621455079; cv=none; d=google.com; s=arc-20160816; b=C+SwIUWu/iSUj6nP+V5Hzoc1sz+1U6Q8lW0VNZgjOaz1Npq79cpodc/psMbnpyt2BR Oq9o6/zocsvA0lJLYYrje0MCOnp+BdxuzR3Y2i4kiA3xl6v33nbLt0qrHQj7Kgoi2dcU fvDSdD5POt2/EPZagVF6K4OzQMRvxof7CApm452SrjOlP0cuQ9y0jyOIV5RDcVR2TE4k Vfv5zgpyz7bfPV3ZhMvy2dR/J2DyJ80SU+glqBWBVq8u6YVbWLGFtO9mw14QqvO7Qod5 6a+jYtQuGKFbUeA8i56mkexSWgpFfUbQt/R4fwI/NO5FpmbPH0KdFjO7ep9XtcEmbqVn 8cHA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date; bh=CqWbEDAdpd3soBNx/Wa/4xVVrH4Ymm1BJ6mPGykA2dM=; b=C3JSzANZ759pRWZSNOUWVW6JQR91PK6kCHzTGCQWH8xVuULHLsAy3LJ1RsPkVI9Vwe NKDlZUO10t6sgp6EzV/ReVwkZ0ImzjUK5ZOM/zty3RiuvPWmDk8sLiuTHQE7tXOiucEr QUQz/LsZ7oSqVCBWScriv+PLd2unIwQrwZM9X+C+1vnzIDJa/7MrjKjBJwSYucacJL/u Wyn1SrgdKHqEWnV8kYlaHkqRXBnX2mtsg3SgIERkds3+u+KHQMJ2w6pRsjylTu1QCsD+ 5uM+cvPCXRDNfIBog7gXQyPBdDIifs5XQ0AkaYtg1fbjRJmrq4ceaxR55HnAx1XXH2Y6 6oxQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id p9si287546iov.92.2021.05.19.13.11.03; Wed, 19 May 2021 13:11:19 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S243254AbhESPJr (ORCPT + 99 others); Wed, 19 May 2021 11:09:47 -0400 Received: from outgoing-auth-1.mit.edu ([18.9.28.11]:48209 "EHLO outgoing.mit.edu" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S231361AbhESPJr (ORCPT ); Wed, 19 May 2021 11:09:47 -0400 Received: from callcc.thunk.org (c-73-8-226-230.hsd1.il.comcast.net [73.8.226.230]) (authenticated bits=0) (User authenticated as tytso@ATHENA.MIT.EDU) by outgoing.mit.edu (8.14.7/8.12.4) with ESMTP id 14JF8Kjh016152 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 19 May 2021 11:08:22 -0400 Received: by callcc.thunk.org (Postfix, from userid 15806) id 70A3C420119; Wed, 19 May 2021 11:08:20 -0400 (EDT) Date: Wed, 19 May 2021 11:08:20 -0400 From: "Theodore Y. Ts'o" To: Wang Jianchao Cc: Andreas Dilger , linux-ext4@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH] ext4: get discard out of jbd2 commit kthread Message-ID: References: <53146e54-af36-0c32-cad8-433460461237@gmail.com> <1d43599f-fed1-b37e-a411-2b0f31583991@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1d43599f-fed1-b37e-a411-2b0f31583991@gmail.com> Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org On Wed, May 19, 2021 at 09:27:56AM +0800, Wang Jianchao wrote: > > We're running ext4 with discard on a nbd device whose backend is storage > cluster. The discard can help to free the unused space to storage pool. > > And sometimes application delete a lot of data and discard is flooding. > Then we see the jbd2 commit kthread is blocked for a long time. Even > move the discard out of jbd2, we still see the write IO of jbd2 log > could be blocked. blk-wbt could help to relieve this. Finally the delay > is shift to allocation path. But this is better than blocking the page > fault path which holds the read mm->mmap_sem. I'm assuming that the problem is when the application deletes a lot of data, the discard flood is causing performance problems on your nbd server. Is that the high level problem that you are trying to solve? So if that's the case, I'd suggest a different approach. First, move kmem_cache_free(ext4_free_data_cachep, entry) out of ext4_free_data_in_buddy() to its caller, ext4_process_data. Then if discard is enabled, after calling ext4_free_data_in_buddy(), the ext4_free_data struct will be detached from rbtree rooted in ext4_group_info.bb_free_root, and then we can attach it to a new rbtree rooted in ext4_group_info.bb_discard_root. This allows the block to be reused as soon the commit is finished (allowing for potentially more efficient block allocations), but we can now keep track of which blocks would be useful for discarding and decouple that from when we release the blocks to be reused. We can now use the pre-existing fstrim kernel thread infrastructure to lock a block group, and we can now iterate over the rbtree, and take into account which blocks have since become allocated --- since if a block has been allocated, there's no need to send a discard for it. I think this will be more efficient, and will allow us to share more of the code for fstrim and the discard-at-runtime model used by "mount -o discard". We can also fine-tune how quickly we issue discards; it might be that if user has executed "rm -rf" it might actually better to wait until the deletes have completed, even if it takes several commit intervals, since it might allow us to combine discards if the blocks 100-199 and 400-500 are released in one commit, and blocks 200-399 are released two or three commits later. Something else I'd urge you to consider is whether it's possible to enhance the nbd protocol to add some kind of back-channel notification when the shared storage is getting low on space. In that case, when the nbd client code a request from the nbd server indicating, "please issue discards if possible", it could either trigger an upcall to userspace, which could then issue the fstrim ioctl, which in the case where "mount -o discard" is enabled, would accelerate when discards took place. We could then make the fstrim thread normally work on a much slower pace, but when there is a signal from the shared storage that space is needed, clients could accelerate when they issue discards to free up shared space. Cheers, - Ted P.S. One other potential thought; if we have established a new bb_discard_root rbtree, it *might* actually be beneficial to consider using that information in the block allocator. One of the best way to tell an SSD that block is no longer needed is to simply overwrite that block. If we do that, we don't need to send a discard to that block any more. Of course, we still want to keep blocks contiguous since even though seeks are free for SSD's, we want to keep large reads contiguous as much as possible, and we want to keep the extent tree as compact as possible. But if we have just released a 12k file, and we are writing a new 12k file, and don't really care *where* in the block group we are writing that file, reusing blocks that had just been freed might actually be a good strategy. That's not something you need to implement in this patch series, but it might be an interesting optimization.