Received: by 2002:a05:6a10:8c0a:0:0:0:0 with SMTP id go10csp1293207pxb; Sun, 21 Feb 2021 20:00:50 -0800 (PST) X-Google-Smtp-Source: ABdhPJyJZI18HaoKa/EfeXY7U4C0KujNCqFe+2FqaXs4hCCiATSJtiXNdBK7O0gswNaqT5Z3L5V8 X-Received: by 2002:a17:906:774d:: with SMTP id o13mr17487620ejn.70.1613966450811; Sun, 21 Feb 2021 20:00:50 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1613966450; cv=none; d=google.com; s=arc-20160816; b=zzQUDLL3GUQ5bHv4PwVyQBlLvv/oFekMElaL72MUUzlPIEeyslpSpiYdj49G59Wbwu KvKFQ9ET2pBlg+neioeukeHRJpJYZZSLpZ4MHdyi5HUtdyMR4MMuVb0tAcS0E7tYRhfW 9LYauGMl9mTWgcp15WdMpE8x+qD8gnU1upUdeRNG33TFv0C7OaIbQRAVd6BUfJ4vJpI0 Fl8/TJ/Pjham7HQSEIo/bXoA5/1YeHa84pPulrKux9A4KCis5Q+qhJr2Ohpzm5o4U1U4 g83IFCNxhZ93AKdm3DOhpiW92k3Ed9QITtWKXlfFSRX0pzTxyvBBCC0eYBL26G+oS8yM FZRg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=r0kLs+AFPmQl7zJbX9+t0c5hMsso6BkSlYALFsByTKY=; b=l7LGM0R3FCSfo5LKUlWrJSuWRkcRhA9rd7FMMdOYokNQaFZRF95wUvC6l+3E62tSoJ Bi4FgwNpYrZ0WlVxEe7xI3m5sUjz58mbKLiWo+/N9L1MvV3RZJY22E2eoJI/xhlrJkEC KS49QYI/nqUc3yKsD67216TK3b5nEIgi8RcyLg+OvbIlHVaFgC3pg9eYBkRyfweJtad/ j1M2tv/TshfICMb9Oli9HRNJvmSby4+HPpHRLir/FQc0rOTeumxAT0LzzJi4/avPgh5g 9KxbAotbBMfgGNtquWwJEWlrknsb+jMfCbbCSS7gxcBgTQTRoSNijKOUuXsBHSjdHef3 ZnKw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=RtUJSkzv; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id y4si11086545edm.19.2021.02.21.20.00.22; Sun, 21 Feb 2021 20:00:50 -0800 (PST) Received-SPF: pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=RtUJSkzv; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229974AbhBVEAQ (ORCPT + 99 others); Sun, 21 Feb 2021 23:00:16 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37354 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229932AbhBVEAP (ORCPT ); Sun, 21 Feb 2021 23:00:15 -0500 Received: from mail-ej1-x62b.google.com (mail-ej1-x62b.google.com [IPv6:2a00:1450:4864:20::62b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C5E39C061574 for ; Sun, 21 Feb 2021 19:59:34 -0800 (PST) Received: by mail-ej1-x62b.google.com with SMTP id t11so26946600ejx.6 for ; Sun, 21 Feb 2021 19:59:34 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=r0kLs+AFPmQl7zJbX9+t0c5hMsso6BkSlYALFsByTKY=; b=RtUJSkzvprZe2p9b41bvvNPv+UWITEkp7p6fG+YtTW05Xnqmi7bDTznogZp72RHoKr UdUOijWGhyZqTgZzNAhDvd7UrhERryBnXYKwDQkWQN7hp7ixjgj51ylQGUCR7ab6KsaN T8DN8JnE99/BxQ4vuaowf3UEVmCw+KCuzE2EbZItK+bZBEWhw59cfdmoq5daqQCMJbtr siEdeomPpz4ybG94F4gUcrZ8OBaGtJDC5/aZD0/N5N6IHzfNUwBEChS5oV6A/0QKdyZW WtaYZ3OLtozRy0pu3qh04E46pgy4Ll2Ept8B6Oux+dco7o091lQ2gFTwPsDa0opkb/7I 4wdQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=r0kLs+AFPmQl7zJbX9+t0c5hMsso6BkSlYALFsByTKY=; b=rYF5TGSVjEDrbOYIziX3ltKCGWRXSOQSa2tEfxNjEY6ewsEKNkOINGkFjePibLMxVS EfL+sXva4i6ZrpfhBe6KwjCCK2K/cL+6B34FfHBLfE0zxBeSTGurtPbiwZuPFcyXOU7C PzdDuQHwx+nRpsykGhUJKgnEu7bM23yWEJYqjLdmmmktl7HaDDHG8uoYPGtjW42yGpOZ VqHLON1P/zFlmayMrwuz78/uqriHWUtOaV/Uau21An8AQmJGkMnDZqD+UTYH86t+eULA kWCfY1cMNWLwI8N95MciFlPs8znh0O7UoLRfkHsr8lz9i1Kw9hdeNhyeU1MsRFM/XhBZ c9Hg== X-Gm-Message-State: AOAM532aBw+wwtObUgsTphczdTTJdxdO4G88pPGDhNcmREun7y+BcgVT 42QSEqyU+Dt7zdbjWzRaoyZnoZr0JwN3eDjanLs= X-Received: by 2002:a17:907:948d:: with SMTP id dm13mr18725437ejc.545.1613966373400; Sun, 21 Feb 2021 19:59:33 -0800 (PST) MIME-Version: 1.0 References: <20210209202857.4185846-1-harshadshirwadkar@gmail.com> <20210209202857.4185846-5-harshadshirwadkar@gmail.com> <7659F518-07CD-4F37-BB6D-FE53458985D6@dilger.ca> In-Reply-To: <7659F518-07CD-4F37-BB6D-FE53458985D6@dilger.ca> From: harshad shirwadkar Date: Sun, 21 Feb 2021 19:59:22 -0800 Message-ID: Subject: Re: [PATCH v2 4/5] ext4: improve cr 0 / cr 1 group scanning To: Andreas Dilger Cc: =?UTF-8?B?0JHQu9Cw0LPQvtC00LDRgNC10L3QutC+INCQ0YDRgtGR0Lw=?= , Ext4 Developers List , "Theodore Y. Ts'o" , Alex Zhuravlev , Shuichi Ihara Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org Thank you for all the feedback Andreas and Artem. Some comments below: > > > > Walk along the ngroups linked elements in worst case for every mb_free_blocks and mb_mark_used which are quite frequently executed actions. > > If double-linked list is used for avg_fragments this function will make this change without iterating through the list: > > 1. Check with previous element. If smaller, then commute > > 2. Check with next element. If greater, then commute. So given that groups are organized by avg_fragment_size in a tree, the worst case for every mb_free_blocks() and mb_mark_used() is actually log(ngroups). But I get your idea. Problem with doing that with rb_tree though is that rb_next() and rb_prev() are not constant time functions since these functions would need to traverse a part of the tree to determine the next / previous element. So I think this optimization may not result in performance improvement. > > I was wondering about the cost of the list/tree maintenance as well, > especially since there was a post from "kernel test robot" that this > patch introduced a performance regression. Yeah, I'm pretty sure that the kernel test robot is complaining mainly because of list/tree maintenance. I think if this optimization is turned off, we probably should not even maintain the tree / lists. That has one downside that is that we will have to disallow setting this option during remount, which I guess is okay? > > The tree insertion/removal overhead I think Artem's proposal above would > improve, since it may be that a group will not move in the tree much? Like I mentioned above, given that we have an average fragment size tree, checking neighboring groups is not a constant time operation. So I don't think that will change performance much. > > It would also make sense for totally full groups to be kept out of the > rb tree entirely, since they do not provide any value in that case (the > full groups will never be selected for allocations), and they just add > to the tree depth and potentially cause an imbalance if there are many > of them. That also has the benefit of the rbtree efficiency *improving* > as the filesystem gets more full, which is right when it is most needed. Ack > > It might also make sense to keep totally empty groups out of the rbtree, > since they should always be found in cr0 already if the allocation is > large enough to fill the whole group? Having a smaller rbtree makes > every insertion/removal that much more efficient. Ack > > Those groups will naturally be re-added into the rbtree when they have > blocks freed or allocated, so not much added complexity. > > > Does it make sense to disable "mb_optimize_scan" if filesystems are > smaller than a certain threshold? Clearly, if there are only 1-2 > groups, maintaining a list and rbtree has no real value, and with > only a handful of groups (< 16?) linear searching is probably as fast > or faster than maintaining the two data structures. That is similar > to e.g. bubble sort vs. quicksort, where it is more efficient to sort > a list of ~5-8 entries with a dumb/fast algorithm instead of a complex > algorithm that is more efficient at larger scales. That would also > (likely) quiet the kernel test robot, if we think that its testing is > not representative of real-world usage. Ack, these are good optimizations. I'll add these in V3. Besides the optimizations mentioned here, I also think we should add "mb_optimize_linear_limit" or such sysfs tunable which will control how many groups should mballoc search linearly before using tree / lists for allocation? That would help us with the disk seek time performance. We discussed on our last call that we probably should consult with the block device's request queue to check if the underlying block device is rotational or not. However, we also discussed that for more complex devices (such as DMs setup on top of HDD and SSD etc), whether the device is rotational or not is not a binary answer and we would need a more complex interface (such as logical range to "is_rotational" table) to make intelligent choice in the file system. Also, in such cases, it is not clear if such a table needs to be passed to the file system during mkfs time? or at mount time? or at run time? Given the number of unknowns in the above discussion, I propose that we start simple and evolve later. So my proposal is that we add a "mb_optimize_linear_limit" tunable that accepts an integer value. In the kernel, for non-rotational devices, that value will be defaulted to 0 (which means no linear scan) and for rotational devices, that value will be defaulted to a reasonable value (-- not sure what that value would be though - 4?). This default can be overridden using the sysfs interface. We can later evolve this interface to accept more complex input such as logical range to rotational status. Does that sound reasonable? Thanks, Harshad > > > On Feb 11, 2021, at 3:30 AM, Andreas Dilger wrote: > > >> This function would be more efficient to do the list move under a single > >> write lock if the order doesn't change. The order loop would just > >> save the largest free order, then grab the write lock, do the list_del(), > >> set bb_largest_free_order, and list_add_tail(): > >> > >> mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp) > >> { > >> struct ext4_sb_info *sbi = EXT4_SB(sb); > >> int i, new_order = -1; > >> > >> for (i = MB_NUM_ORDERS(sb) - 1; i >= 0; i--) { > >> if (grp->bb_counters[i] > 0) { > >> new_order = i; > >> break; > >> } > >> } > >> if (test_opt2(sb, MB_OPTIMIZE_SCAN) && grp->bb_largest_free_order >= 0) { > >> write_lock(&sbi->s_mb_largest_free_orders_locks[ > >> grp->bb_largest_free_order]); > >> list_del_init(&grp->bb_largest_free_order_node); > >> > >> if (new_order != grp->bb_largest_free_order) { > >> write_unlock(&sbi->s_mb_largest_free_orders_locks[ > >> grp->bb_largest_free_order]); > >> grp->bb_largest_free_order = new_order; > >> write_lock(&sbi->s_mb_largest_free_orders_locks[ > >> grp->bb_largest_free_order]); > >> } > >> list_add_tail(&grp->bb_largest_free_order_node, > >> &sbi->s_mb_largest_free_orders[grp->bb_largest_free_order]); > >> write_unlock(&sbi->s_mb_largest_free_orders_locks[ > >> grp->bb_largest_free_order]); > >> } > >> } > > In looking at my previous comment, I wonder if we could further reduce > the list locking here by not moving an entry to the end of the *same* > list if it is not currently at the head? Since it was (presumably) > just moved to the end of the list by a recent allocation, it is very > likely that some other group will be chosen from the list head, so > moving within the list to maintain strict LRU is probably just extra > locking overhead that can be avoided... > > Also, it isn't clear if *freeing* blocks from a group should move it > to the end of the same list, or just leave it as-is? If there are > more frees from the list it is likely to be added to a new list soon, > and if there are no more frees, then it could stay in the same order. > > > Cheers, Andreas > > > > >