From: Changwei Ge <ge.changwei@h3c.com>
To: Gang He <ghe@suse.com>, "jlbec@evilplan.org" <jlbec@evilplan.org>,
        "mfasheh@versity.com" <mfasheh@versity.com>
CC: "ocfs2-devel@oss.oracle.com" <ocfs2-devel@oss.oracle.com>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [Ocfs2-devel] [PATCH v2 2/2] ocfs2: add trimfs lock to avoid
 duplicated trims in cluster
Thread-Topic: [Ocfs2-devel] [PATCH v2 2/2] ocfs2: add trimfs lock to avoid
 duplicated trims in cluster
Thread-Index: AdOJ3PVpr6PIaeXlR1CPvv/Xoq4ERA==
Date: Thu, 11 Jan 2018 06:59:54 +0000
Message-ID: <63ADC13FD55D6546B7DECE290D39E373F290F0AE@H3CMLB12-EX.srv.huawei-3com.com>
References: <1513228484-2084-1-git-send-email-ghe@suse.com>
 <1513228484-2084-2-git-send-email-ghe@suse.com>
 <63ADC13FD55D6546B7DECE290D39E373F290E6C5@H3CMLB12-EX.srv.huawei-3com.com>
 <5A5647C2020000F9000A2929@prv-mh.provo.novell.com>
 <63ADC13FD55D6546B7DECE290D39E373F290E878@H3CMLB12-EX.srv.huawei-3com.com>
 <5A5657D8020000F9000A2987@prv-mh.provo.novell.com>
 <63ADC13FD55D6546B7DECE290D39E373F290E930@H3CMLB12-EX.srv.huawei-3com.com>
 <5A57372A020000F9000A2E01@prv-mh.provo.novell.com>
 <63ADC13FD55D6546B7DECE290D39E373F290EED7@H3CMLB12-EX.srv.huawei-3com.com>
 <5A574B7F020000F9000A2EF7@prv-mh.provo.novell.com>
 <63ADC13FD55D6546B7DECE290D39E373F290EF51@H3CMLB12-EX.srv.huawei-3com.com>
 <5A57590B020000F9000A2F27@prv-mh.provo.novell.com>
Accept-Language: en-US, zh-CN
Content-Language: en-US
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 8BIT
MIME-Version: 1.0
Sender: linux-kernel-owner@vger.kernel.org

On 2018/1/11 12:31, Gang He wrote:
> Hi Changwei,
> 
> 
>>>>
>> On 2018/1/11 11:33, Gang He wrote:
>>> Hi Changwei,
>>>
>>>
>>>>>>
>>>> On 2018/1/11 10:07, Gang He wrote:
>>>>> Hi Changwei,
>>>>>
>>>>>
>>>>>>>>
>>>>>> On 2018/1/10 18:14, Gang He wrote:
>>>>>>> Hi Changwei,
>>>>>>>
>>>>>>>
>>>>>>>>>>
>>>>>>>> On 2018/1/10 17:05, Gang He wrote:
>>>>>>>>> Hi Changwei,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>> Hi Gang,
>>>>>>>>>>
>>>>>>>>>> On 2017/12/14 13:16, Gang He wrote:
>>>>>>>>>>> As you know, ocfs2 has support trim the underlying disk via
>>>>>>>>>>> fstrim command. But there is a problem, ocfs2 is a shared disk
>>>>>>>>>>> cluster file system, if the user configures a scheduled fstrim
>>>>>>>>>>> job on each file system node, this will trigger multiple nodes
>>>>>>>>>>> trim a shared disk simultaneously, it is very wasteful for CPU
>>>>>>>>>>> and IO consumption, also might negatively affect the lifetime
>>>>>>>>>>> of poor-quality SSD devices.
>>>>>>>>>>> Then, we introduce a trimfs dlm lock to communicate with each
>>>>>>>>>>> other in this case, which will make only one fstrim command to
>>>>>>>>>>> do the trimming on a shared disk among the cluster, the fstrim
>>>>>>>>>>> commands from the other nodes should wait for the first fstrim
>>>>>>>>>>> to finish and returned success directly, to avoid running a the
>>>>>>>>>>> same trim on the shared disk again.
>>>>>>>>>>>
>>>>>>>>>>> Compare with first version, I change the fstrim commands' returned
>>>>>>>>>>> value and behavior in case which meets a fstrim command is running
>>>>>>>>>>> on a shared disk.
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Gang He <ghe@suse.com>
>>>>>>>>>>> ---
>>>>>>>>>>>        fs/ocfs2/alloc.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>>>>        1 file changed, 44 insertions(+)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c
>>>>>>>>>>> index ab5105f..5c9c3e2 100644
>>>>>>>>>>> --- a/fs/ocfs2/alloc.c
>>>>>>>>>>> +++ b/fs/ocfs2/alloc.c
>>>>>>>>>>> @@ -7382,6 +7382,7 @@ int ocfs2_trim_fs(struct super_block *sb, struct
>>>>>>>>>> fstrim_range *range)
>>>>>>>>>>>        	struct buffer_head *gd_bh = NULL;
>>>>>>>>>>>        	struct ocfs2_dinode *main_bm;
>>>>>>>>>>>        	struct ocfs2_group_desc *gd = NULL;
>>>>>>>>>>> +	struct ocfs2_trim_fs_info info, *pinfo = NULL;
>>>>>>>>>>
>>>>>>>>>> I think *pinfo* is not necessary.
>>>>>>>>> This pointer is necessary, since it can be NULL or non-NULL depend on the
>>>>>>>> code logic.
>>>>>>>>
>>>>>>>> This point is OK for me.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>        
>>>>>>>>>>>        	start = range->start >> osb->s_clustersize_bits;
>>>>>>>>>>>        	len = range->len >> osb->s_clustersize_bits;
>>>>>>>>>>> @@ -7419,6 +7420,42 @@ int ocfs2_trim_fs(struct super_block *sb, struct
>>>>>>>>>> fstrim_range *range)
>>>>>>>>>>>        
>>>>>>>>>>>        	trace_ocfs2_trim_fs(start, len, minlen);
>>>>>>>>>>>        
>>>>>>>>>>> +	ocfs2_trim_fs_lock_res_init(osb);
>>>>>>>>>>> +	ret = ocfs2_trim_fs_lock(osb, NULL, 1);
>>>>>>>>>>
>>>>>>>>>> I don't get why try to lock here and if fails, acquire the same lock again
>>>>>>>>>> later but wait until granted.
>>>>>>>>> Please think about the user case, the patch is only used to handle this
>>>>>>>> case.
>>>>>>>>> When the administer configures a fstrim schedule task on each node, then
>>>>>>>> each node will trigger a fstrim on shared disks concurrently.
>>>>>>>>> In this case, we should avoid duplicated fstrim on a shared disk since this
>>>>>>>> will waste CPU/IO resources and affect SSD lifetime sometimes.
>>>>>>>>
>>>>>>>> I'm not worrying about that trimfs will affect SSD's lifetime quite a lot,
>>>>>>>> since physical-logical address converting table resides in RAM while SSD is
>>>>>>>> working.
>>>>>>>> And that table won't be at a big scale. My point here is not affecting this
>>>>>>>> patch. Just a tip here.
>>>>>>> This depend on SSD firmware implementation, but for secure-trim, it really
>>>>>> possibly affect SSD lifetime.
>>>>>>>
>>>>>>>>> Firstly, we use try_lock to get fstrim dlm lock to identify if there is any
>>>>>>>> other node which is doing fstrim on the disk.
>>>>>>>>> If not, this node is the first one, this node should do fstrim operation on
>>>>>>>> the disk.
>>>>>>>>> If yes, this node is not the first one, this node should wait until the
>>>>>>>> first node is done for fstrim operation, then return the result from DLM
>>>>>>>> lock's value.
>>>>>>>>>
>>>>>>>>>> Can it just acquire the _trimfs_ lock as a blocking one directly here?
>>>>>>>>> We can not do a blocking lock directly, since we need to identify if there
>>>>>>>> is any other node has being do fstrim operation when this node start to do
>>>>>>>> fstrim.
>>>>>>>>
>>>>>>>> Thanks for your elaboration.
>>>>>>>>
>>>>>>>> Well how about the third node trying to trimming fs too?
>>>>>>>> It needs LVB from the second node.
>>>>>>>> But it seems that the second node can't provide a valid LVB.
>>>>>>>> So the third node will perform trimfs once more.
>>>>>>> No, the second node does not change DLM lock's value, but the DLM lock's
>>>>>> value is still valid.
>>>>>>> The third node also refer to this DLM lock's value, then do the same logic
>>>>>> like the second node.
>>>>>>
>>>>>> Hi Gang,
>>>>>> I don't see any places where ocfs2_lock_res::ocfs2_lock_res_ops::set_lvb is
>>>>>> set while flag LOCK_TYPE_USES_LVB is added.
>>>>>>
>>>>>> Are you sure below code path can work well?
>>>>> Yes, have done a full testing on two and three nodes.
>>>>>
>>>>>> ocfs2_process_blocked_lock
>>>>>>       ocfs2_unblock_lock
>>>>>>           Reference to ::set_lvb since LOCK_TYPE_USES_LVB is set.
>>>>>>
>>>>> the set_lvb callback function is not necessary, if we update DLM lock value
>>>> by ourselves before unlock.
>>>>
>>>> I think this may relates to *LOCK_TYPE_REQUIRES_REFRESH* flag.
>>> Alright, I think *LOCK_TYPE_REQUIRES_REFRESH* flag is harmless here,
>> although this flag is probably unnecessary.
>>
>> OK, I agree with adding *LOCK_TYPE_REQUIRES_REFRESH* for now.
>> Can you give a explanation for my another concern about three nodes'
>> concurrent trimming fs.?
>>
>> For your convenience, I paste it here:
>>
>> The LVB passing path should be like below:
>>
>> NODE 1 lvb (ex granted at time1) -> NODE 2 lvb(ex granted at time2) -> NODE 3
>> lvb(ex granted at time3).
>> time1 < time2  < time3
>>
>> So I think NODE 3 can't obtain LVB from NODE 1 but from NODE 2.
> Yes, if node1 finished the fstrim successfully, node1 updated DLM lock value, then this DLM lock value is valid and unlocked/dropped.
> node2 returned the result from this DLM lock value, node2 did not change DLM lock value (but value is still valid) and unlocked/dropped.
> node3 returned the result from this DLM lock value (node1 updated), node3 did not change DLM lock value (but value is still valid) and unlocked/dropped.
> after the last nodeN returned and unlocked/dropped this lock resource, this DLm lock value will become invalid.
> 
> Next round, the first node gets the lock and update the fstrim result to DLM lock value directly.
> the same logic like before.
> 
> And more, this DLM lock value validity is OK when some node is suddenly crashed during the above case.

I got your point.
You don't update LVB when node 2 or node 3 gets EX lock granted, so the original LVB from node 1 will be transferred back to node 1, right?
> 
> Thanks
> Gang
> 
>> Moreover, if node 1 is the master of trimfs lock resource, node 1's LVB will
>> be updated to be the same as node 2.
>>    
>>>
>>>> Actually, I don't see why this flag is necessary to _orphan scan_.
>>>> Why can't _orphan scan_ also set LVB during
>>>> ocfs2_process_blocked_lock->ocfs2_unblock_lock?
>>>>
>>>> And it seems that _orphan scan_ also doesn't need to persist any stuff in
>>>> LVB into disk.
>>> More comments, you can look at dlmglue.c file carefully.
>>> set_lvb is a callback function, which will be invoked automatically before
>> downgrade.
>>> you can use this mechanism, you also do not do like that.
>>> you just need to make sure to update DLM lock value before unlock/downgrade.
>>>
>>> Thanks
>>> Gang
>>>    
>>>>
>>>> Thanks,
>>>> Changwei
>>>>
>>>>> By the way, the code is transparent to the underlying DLM stack (o2cb or
>>>> pcmk).
>>>>
>>>> True.
>>>>>
>>>>> Thanks
>>>>> Gang
>>>>>
>>>>>> Thanks,
>>>>>> Changwei
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> IOW, three nodes are trying to trimming fs concurrently. Is your patch able
>>>>>>>> to handle such a scenario?
>>>>>>> Yes, the patch can handle this case.
>>>>>>>
>>>>>>>>
>>>>>>>> Even the second lock request with QUEUE set just follows
>>>>>>>> ocfs2_trim_fs_lock_res_uninit() will not get rid of concurrent trimfs.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> +	if (ret < 0) {
>>>>>>>>>>> +		if (ret != -EAGAIN) {
>>>>>>>>>>> +			mlog_errno(ret);
>>>>>>>>>>> +			ocfs2_trim_fs_lock_res_uninit(osb);
>>>>>>>>>>> +			goto out_unlock;
>>>>>>>>>>> +		}
>>>>>>>>>>> +
>>>>>>>>>>> +		mlog(ML_NOTICE, "Wait for trim on device (%s) to "
>>>>>>>>>>> +		     "finish, which is running from another node.\n",
>>>>>>>>>>> +		     osb->dev_str);
>>>>>>>>>>> +		ret = ocfs2_trim_fs_lock(osb, &info, 0);
>>>>>>>>>>> +		if (ret < 0) {
>>>>>>>>>>> +			mlog_errno(ret);
>>>>>>>>>>> +			ocfs2_trim_fs_lock_res_uninit(osb);
>>>>>>>>>>
>>>>>>>>>> In ocfs2_trim_fs_lock_res_uninit(), you drop lock. But it is never granted.
>>>>>>>>>> Still need to drop lock resource?
>>>>>>>>> Yes, we need to init/uninit fstrim dlm lock resource for each time.
>>>>>>>>> Otherwise, trylock does not work, this is a little different from other dlm
>>>>>>>> lock usage in ocfs2.
>>>>>>>>
>>>>>>>> This point is OK for now, too.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> +			goto out_unlock;
>>>>>>>>>>> +		}
>>>>>>>>>>> +
>>>>>>>>>>> +		if (info.tf_valid && info.tf_success &&
>>>>>>>>>>> +		    info.tf_start == start && info.tf_len == len &&
>>>>>>>>>>> +		    info.tf_minlen == minlen) {
>>>>>>>>>>> +			/* Avoid sending duplicated trim to a shared device */
>>>>>>>>>>> +			mlog(ML_NOTICE, "The same trim on device (%s) was "
>>>>>>>>>>> +			     "just done from node (%u), return.\n",
>>>>>>>>>>> +			     osb->dev_str, info.tf_nodenum);
>>>>>>>>>>> +			range->len = info.tf_trimlen;
>>>>>>>>>>> +			goto out_trimunlock;
>>>>>>>>>>> +		}
>>>>>>>>>>> +	}
>>>>>>>>>>> +
>>>>>>>>>>> +	info.tf_nodenum = osb->node_num;
>>>>>>>>>>> +	info.tf_start = start;
>>>>>>>>>>> +	info.tf_len = len;
>>>>>>>>>>> +	info.tf_minlen = minlen;
>>>>>>>>>>
>>>>>>>>>> If we faild during dong trimfs, I think we should not cache above info in
>>>>>>>>>> LVB.
>>>>>>>>> It is necessary, if the second node is waiting the first node, the first
>>>>>>>> node fails to do fstrim,
>>>>>>>>> the first node should update dlm lock's value, then the second node can get
>>>>>>>> the latest dlm lock value (rather than the last time DLM lock value),
>>>>>>>>> the second node will do the fstrim again, since the first node has failed.
>>>>>>>>
>>>>>>>> Yes, it makes scene.
>>>>>>>>>
>>>>>>>>>> BTW, it seems that this patch is on top of  'try lock' patches which you
>>>>>>>>>> previously sent out.
>>>>>>>>>> Are they related?
>>>>>>>>> try lock patch is related to non-block aio support for ocfs2.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Gang
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Changwei
>>>>>>>>>>
>>>>>>>>>>> +
>>>>>>>>>>>        	/* Determine first and last group to examine based on start and len */
>>>>>>>>>>>        	first_group = ocfs2_which_cluster_group(main_bm_inode, start);
>>>>>>>>>>>        	if (first_group == osb->first_cluster_group_blkno)
>>>>>>>>>>> @@ -7463,6 +7500,13 @@ int ocfs2_trim_fs(struct super_block *sb, struct
>>>>>>>>>> fstrim_range *range)
>>>>>>>>>>>        			group += ocfs2_clusters_to_blocks(sb, osb->bitmap_cpg);
>>>>>>>>>>>        	}
>>>>>>>>>>>        	range->len = trimmed * sb->s_blocksize;
>>>>>>>>>>> +
>>>>>>>>>>> +	info.tf_trimlen = range->len;
>>>>>>>>>>> +	info.tf_success = (ret ? 0 : 1);
>>>>>>>>>>> +	pinfo = &info;
>>>>>>>>>>> +out_trimunlock:
>>>>>>>>>>> +	ocfs2_trim_fs_unlock(osb, pinfo);
>>>>>>>>>>> +	ocfs2_trim_fs_lock_res_uninit(osb);
>>>>>>>>>>>        out_unlock:
>>>>>>>>>>>        	ocfs2_inode_unlock(main_bm_inode, 0);
>>>>>>>>>>>        	brelse(main_bm_bh);
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>>>
>