Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754586AbeAJKxw convert rfc822-to-8bit (ORCPT + 1 other); Wed, 10 Jan 2018 05:53:52 -0500 Received: from smtp.h3c.com ([60.191.123.56]:34976 "EHLO h3cmg01-ex.h3c.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1753966AbeAJKxl (ORCPT ); Wed, 10 Jan 2018 05:53:41 -0500 From: Changwei Ge To: Gang He , "jlbec@evilplan.org" , "mfasheh@versity.com" CC: "ocfs2-devel@oss.oracle.com" , "linux-kernel@vger.kernel.org" Subject: Re: [Ocfs2-devel] [PATCH v2 2/2] ocfs2: add trimfs lock to avoid duplicated trims in cluster Thread-Topic: [Ocfs2-devel] [PATCH v2 2/2] ocfs2: add trimfs lock to avoid duplicated trims in cluster Thread-Index: AdOJ3PVpr6PIaeXlR1CPvv/Xoq4ERA== Date: Wed, 10 Jan 2018 10:51:37 +0000 Message-ID: <63ADC13FD55D6546B7DECE290D39E373F290E906@H3CMLB12-EX.srv.huawei-3com.com> References: <1513228484-2084-1-git-send-email-ghe@suse.com> <1513228484-2084-2-git-send-email-ghe@suse.com> <63ADC13FD55D6546B7DECE290D39E373F290E6C5@H3CMLB12-EX.srv.huawei-3com.com> <5A5647C2020000F9000A2929@prv-mh.provo.novell.com> <63ADC13FD55D6546B7DECE290D39E373F290E878@H3CMLB12-EX.srv.huawei-3com.com> <5A5657D8020000F9000A2987@prv-mh.provo.novell.com> Accept-Language: en-US, zh-CN Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.125.136.231] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 8BIT MIME-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Return-Path: Hi Gang, On 2018/1/10 18:14, Gang He wrote: > Hi Changwei, > > >>>> >> On 2018/1/10 17:05, Gang He wrote: >>> Hi Changwei, >>> >>> >>>>>> >>>> Hi Gang, >>>> >>>> On 2017/12/14 13:16, Gang He wrote: >>>>> As you know, ocfs2 has support trim the underlying disk via >>>>> fstrim command. But there is a problem, ocfs2 is a shared disk >>>>> cluster file system, if the user configures a scheduled fstrim >>>>> job on each file system node, this will trigger multiple nodes >>>>> trim a shared disk simultaneously, it is very wasteful for CPU >>>>> and IO consumption, also might negatively affect the lifetime >>>>> of poor-quality SSD devices. >>>>> Then, we introduce a trimfs dlm lock to communicate with each >>>>> other in this case, which will make only one fstrim command to >>>>> do the trimming on a shared disk among the cluster, the fstrim >>>>> commands from the other nodes should wait for the first fstrim >>>>> to finish and returned success directly, to avoid running a the >>>>> same trim on the shared disk again. >>>>> >>>>> Compare with first version, I change the fstrim commands' returned >>>>> value and behavior in case which meets a fstrim command is running >>>>> on a shared disk. >>>>> >>>>> Signed-off-by: Gang He >>>>> --- >>>>> fs/ocfs2/alloc.c | 44 ++++++++++++++++++++++++++++++++++++++++++++ >>>>> 1 file changed, 44 insertions(+) >>>>> >>>>> diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c >>>>> index ab5105f..5c9c3e2 100644 >>>>> --- a/fs/ocfs2/alloc.c >>>>> +++ b/fs/ocfs2/alloc.c >>>>> @@ -7382,6 +7382,7 @@ int ocfs2_trim_fs(struct super_block *sb, struct >>>> fstrim_range *range) >>>>> struct buffer_head *gd_bh = NULL; >>>>> struct ocfs2_dinode *main_bm; >>>>> struct ocfs2_group_desc *gd = NULL; >>>>> + struct ocfs2_trim_fs_info info, *pinfo = NULL; >>>> >>>> I think *pinfo* is not necessary. >>> This pointer is necessary, since it can be NULL or non-NULL depend on the >> code logic. >> >> This point is OK for me. >> >>> >>>>> >>>>> start = range->start >> osb->s_clustersize_bits; >>>>> len = range->len >> osb->s_clustersize_bits; >>>>> @@ -7419,6 +7420,42 @@ int ocfs2_trim_fs(struct super_block *sb, struct >>>> fstrim_range *range) >>>>> >>>>> trace_ocfs2_trim_fs(start, len, minlen); >>>>> >>>>> + ocfs2_trim_fs_lock_res_init(osb); >>>>> + ret = ocfs2_trim_fs_lock(osb, NULL, 1); >>>> >>>> I don't get why try to lock here and if fails, acquire the same lock again >>>> later but wait until granted. >>> Please think about the user case, the patch is only used to handle this >> case. >>> When the administer configures a fstrim schedule task on each node, then >> each node will trigger a fstrim on shared disks concurrently. >>> In this case, we should avoid duplicated fstrim on a shared disk since this >> will waste CPU/IO resources and affect SSD lifetime sometimes. >> >> I'm not worrying about that trimfs will affect SSD's lifetime quite a lot, >> since physical-logical address converting table resides in RAM while SSD is >> working. >> And that table won't be at a big scale. My point here is not affecting this >> patch. Just a tip here. > This depend on SSD firmware implementation, but for secure-trim, it really possibly affect SSD lifetime. > >>> Firstly, we use try_lock to get fstrim dlm lock to identify if there is any >> other node which is doing fstrim on the disk. >>> If not, this node is the first one, this node should do fstrim operation on >> the disk. >>> If yes, this node is not the first one, this node should wait until the >> first node is done for fstrim operation, then return the result from DLM >> lock's value. >>> >>>> Can it just acquire the _trimfs_ lock as a blocking one directly here? >>> We can not do a blocking lock directly, since we need to identify if there >> is any other node has being do fstrim operation when this node start to do >> fstrim. >> >> Thanks for your elaboration. >> >> Well how about the third node trying to trimming fs too? >> It needs LVB from the second node. >> But it seems that the second node can't provide a valid LVB. >> So the third node will perform trimfs once more. > No, the second node does not change DLM lock's value, but the DLM lock's value is still valid. > The third node also refer to this DLM lock's value, then do the same logic like the second node. Um, as I know SUSE is using fs/dlm which I am not familiar with. But for ocfs2/dlm, the LVB passing path should be like below: NODE 1 lvb (ex granted at time1) -> NODE 2 lvb(ex granted at time2) -> NODE 3 lvb(ex granted at time3). time1 < time2 < time3 So I think NODE 3 can't obtain LVB from NODE 1 but from NODE 2. Moreover, if node 1 is the master of trimfs lock resource, node 1's LVB will be updated to be the same as node 2. Thanks, Changwei > >> >> IOW, three nodes are trying to trimming fs concurrently. Is your patch able >> to handle such a scenario? > Yes, the patch can handle this case. > >> >> Even the second lock request with QUEUE set just follows >> ocfs2_trim_fs_lock_res_uninit() will not get rid of concurrent trimfs. >> >>> >>>> >>>>> + if (ret < 0) { >>>>> + if (ret != -EAGAIN) { >>>>> + mlog_errno(ret); >>>>> + ocfs2_trim_fs_lock_res_uninit(osb); >>>>> + goto out_unlock; >>>>> + } >>>>> + >>>>> + mlog(ML_NOTICE, "Wait for trim on device (%s) to " >>>>> + "finish, which is running from another node.\n", >>>>> + osb->dev_str); >>>>> + ret = ocfs2_trim_fs_lock(osb, &info, 0); >>>>> + if (ret < 0) { >>>>> + mlog_errno(ret); >>>>> + ocfs2_trim_fs_lock_res_uninit(osb); >>>> >>>> In ocfs2_trim_fs_lock_res_uninit(), you drop lock. But it is never granted. >>>> Still need to drop lock resource? >>> Yes, we need to init/uninit fstrim dlm lock resource for each time. >>> Otherwise, trylock does not work, this is a little different from other dlm >> lock usage in ocfs2. >> >> This point is OK for now, too. >>> >>>> >>>>> + goto out_unlock; >>>>> + } >>>>> + >>>>> + if (info.tf_valid && info.tf_success && >>>>> + info.tf_start == start && info.tf_len == len && >>>>> + info.tf_minlen == minlen) { >>>>> + /* Avoid sending duplicated trim to a shared device */ >>>>> + mlog(ML_NOTICE, "The same trim on device (%s) was " >>>>> + "just done from node (%u), return.\n", >>>>> + osb->dev_str, info.tf_nodenum); >>>>> + range->len = info.tf_trimlen; >>>>> + goto out_trimunlock; >>>>> + } >>>>> + } >>>>> + >>>>> + info.tf_nodenum = osb->node_num; >>>>> + info.tf_start = start; >>>>> + info.tf_len = len; >>>>> + info.tf_minlen = minlen; >>>> >>>> If we faild during dong trimfs, I think we should not cache above info in >>>> LVB. >>> It is necessary, if the second node is waiting the first node, the first >> node fails to do fstrim, >>> the first node should update dlm lock's value, then the second node can get >> the latest dlm lock value (rather than the last time DLM lock value), >>> the second node will do the fstrim again, since the first node has failed. >> >> Yes, it makes scene. >>> >>>> BTW, it seems that this patch is on top of 'try lock' patches which you >>>> previously sent out. >>>> Are they related? >>> try lock patch is related to non-block aio support for ocfs2. >>> >>> Thanks >>> Gang >>>> >>>> Thanks, >>>> Changwei >>>> >>>>> + >>>>> /* Determine first and last group to examine based on start and len */ >>>>> first_group = ocfs2_which_cluster_group(main_bm_inode, start); >>>>> if (first_group == osb->first_cluster_group_blkno) >>>>> @@ -7463,6 +7500,13 @@ int ocfs2_trim_fs(struct super_block *sb, struct >>>> fstrim_range *range) >>>>> group += ocfs2_clusters_to_blocks(sb, osb->bitmap_cpg); >>>>> } >>>>> range->len = trimmed * sb->s_blocksize; >>>>> + >>>>> + info.tf_trimlen = range->len; >>>>> + info.tf_success = (ret ? 0 : 1); >>>>> + pinfo = &info; >>>>> +out_trimunlock: >>>>> + ocfs2_trim_fs_unlock(osb, pinfo); >>>>> + ocfs2_trim_fs_lock_res_uninit(osb); >>>>> out_unlock: >>>>> ocfs2_inode_unlock(main_bm_inode, 0); >>>>> brelse(main_bm_bh); >>>>> >>> >