Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752837AbdL1GNo (ORCPT ); Thu, 28 Dec 2017 01:13:44 -0500 Received: from szxga07-in.huawei.com ([45.249.212.35]:60563 "EHLO huawei.com" rhost-flags-OK-FAIL-OK-FAIL) by vger.kernel.org with ESMTP id S1750987AbdL1GNl (ORCPT ); Thu, 28 Dec 2017 01:13:41 -0500 Message-ID: <5A448B7E.6050605@huawei.com> Date: Thu, 28 Dec 2017 14:13:18 +0800 From: alex chen User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20130509 Thunderbird/17.0.6 MIME-Version: 1.0 To: Gang He CC: , , , , Subject: Re: [Ocfs2-devel] [PATCH] ocfs2: try a blocking lock before return AOP_TRUNCATED_PAGE References: <1514366960-10588-1-git-send-email-ghe@suse.com> <5A4372AA.1080007@huawei.com> <5A43E86D020000F9000A0683@prv-mh.provo.novell.com> <5A4450AB.2000106@huawei.com> <5A44CC0E020000F9000A0759@prv-mh.provo.novell.com> In-Reply-To: <5A44CC0E020000F9000A0759@prv-mh.provo.novell.com> Content-Type: text/plain; charset="ISO-8859-1" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.177.26.59] X-CFilter-Loop: Reflected Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 7303 Lines: 173 On 2017/12/28 10:48, Gang He wrote: > Hi Alex, > > >>>> >> Hi Gang, >> >> On 2017/12/27 18:37, Gang He wrote: >>> Hi Jun, >>> >>> >>>>>> >>>> Hi Gang, >>>> >>>> Do you mean that too many retrys in loop cast losts of CPU-time and >>>> block page-fault interrupt? We should not add any delay in >>>> ocfs2_fault(), right? And I still feel a little confused why your >>>> method can solve this problem. >>> You can see the related code in function filemap_fault(), if ocfs2 fails to >> read a page since >>> it can not get a inode lock with non-block mode, the VFS layer code will >> invoke ocfs2 >>> read page call back function circularly, this will lead to a softlockup >> problem (like the below back trace). >>> So, we should get a blocking lock to let the dlm lock to this node and also >> can avoid CPU loop, >> Can we use 'cond_resched()' to allow the thread to release the CPU >> temperately for solving this softlockup? > Yes, we can use cond_resched() function to avoid this softlockup. > In fact, if the kernel is configured with CONFIG_PREEMPT=y, this softlockup does not happen since the kernel can help. > But, this way still leads to CPU resource waste, CPU usage can reach about 80% - 100% when > multiple nodes read/write/mmap-access the same file concurrently, and more, the read/write/mmap-access > speed is more lower (50% decrease). > Why? > Because we need to get DLM lock for each node, before one node gets DLM lock, another node has > to down-convert this DLM lock, that means flushing the memory data to the disk before DLM lock down-conversion. > this disk IO operation is very slow compared with CPU cycle, that means the node which want to get DLM lock, > will do lots of reties before another node complete down-converting this DLM lock, actual, these retries do not make > sense, just waste CPU cycle. > So, if we add a blocking lock/unlock here, we will avoid these unnecessary reties, especially in case slow-speed disk and more ocfs2 nodes(>=3). > I did the ocfs2 test case (multi_mmap in multiple_run.sh), after applied this patch, the CPU rate on each node was about 40%-50%, and the test case > execution time reduced by half. > the full command is as below, > multiple_run.sh -i eth0 -k ~/linux-4.4.21-69.tar.gz -o ~/ocfs2mullog -C hacluster -s pcmk -n nd1,nd2,nd3 -d /dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared > the shared storage is a iscsi disk. > OK, I think it is more better if you can add you test method and result in change log. Thanks, Alex > Thanks > Gang > >> >>> second, base on my testing, the patch also can improve the efficiency in >> case modifying the same >>> file frequently from multiple nodes, since the lock acquisition chance is >> more fair. >>> In fact, the code was modified by a patch 1cce4df04f37 ("ocfs2: do not >> lock/unlock() inode DLM lock"), >>> before that patch, the code is the same, this patch can be considered to >> revert that patch, except adding more >>> clear comments. >> In patch 1cce4df04f37 ("ocfs2: do not lock/unlock() inode DLM lock"), >> Goldwyn says blocking lock and unlock will only make >> the performance worse where contention over the locks is high, which is the >> opposite of your described above. >> IMO, blocking lock and unlock here is indeed unnecessary. >> >> Thanks, >> Alex >>> >>> Thanks >>> Gang >>> >>> >>>> >>>> thanks, >>>> Jun >>>> >>>> On 2017/12/27 17:29, Gang He wrote: >>>>> If we can't get inode lock immediately in the function >>>>> ocfs2_inode_lock_with_page() when reading a page, we should not >>>>> return directly here, since this will lead to a softlockup problem. >>>>> The method is to get a blocking lock and immediately unlock before >>>>> returning, this can avoid CPU resource waste due to lots of retries, >>>>> and benefits fairness in getting lock among multiple nodes, increase >>>>> efficiency in case modifying the same file frequently from multiple >>>>> nodes. >>>>> The softlockup problem looks like, >>>>> Kernel panic - not syncing: softlockup: hung tasks >>>>> CPU: 0 PID: 885 Comm: multi_mmap Tainted: G L 4.12.14-6.1-default #1 >>>>> Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 >>>>> Call Trace: >>>>> >>>>> dump_stack+0x5c/0x82 >>>>> panic+0xd5/0x21e >>>>> watchdog_timer_fn+0x208/0x210 >>>>> ? watchdog_park_threads+0x70/0x70 >>>>> __hrtimer_run_queues+0xcc/0x200 >>>>> hrtimer_interrupt+0xa6/0x1f0 >>>>> smp_apic_timer_interrupt+0x34/0x50 >>>>> apic_timer_interrupt+0x96/0xa0 >>>>> >>>>> RIP: 0010:unlock_page+0x17/0x30 >>>>> RSP: 0000:ffffaf154080bc88 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10 >>>>> RAX: dead000000000100 RBX: fffff21e009f5300 RCX: 0000000000000004 >>>>> RDX: dead0000000000ff RSI: 0000000000000202 RDI: fffff21e009f5300 >>>>> RBP: 0000000000000000 R08: 0000000000000000 R09: ffffaf154080bb00 >>>>> R10: ffffaf154080bc30 R11: 0000000000000040 R12: ffff993749a39518 >>>>> R13: 0000000000000000 R14: fffff21e009f5300 R15: fffff21e009f5300 >>>>> ocfs2_inode_lock_with_page+0x25/0x30 [ocfs2] >>>>> ocfs2_readpage+0x41/0x2d0 [ocfs2] >>>>> ? pagecache_get_page+0x30/0x200 >>>>> filemap_fault+0x12b/0x5c0 >>>>> ? recalc_sigpending+0x17/0x50 >>>>> ? __set_task_blocked+0x28/0x70 >>>>> ? __set_current_blocked+0x3d/0x60 >>>>> ocfs2_fault+0x29/0xb0 [ocfs2] >>>>> __do_fault+0x1a/0xa0 >>>>> __handle_mm_fault+0xbe8/0x1090 >>>>> handle_mm_fault+0xaa/0x1f0 >>>>> __do_page_fault+0x235/0x4b0 >>>>> trace_do_page_fault+0x3c/0x110 >>>>> async_page_fault+0x28/0x30 >>>>> RIP: 0033:0x7fa75ded638e >>>>> RSP: 002b:00007ffd6657db18 EFLAGS: 00010287 >>>>> RAX: 000055c7662fb700 RBX: 0000000000000001 RCX: 000055c7662fb700 >>>>> RDX: 0000000000001770 RSI: 00007fa75e909000 RDI: 000055c7662fb700 >>>>> RBP: 0000000000000003 R08: 000000000000000e R09: 0000000000000000 >>>>> R10: 0000000000000483 R11: 00007fa75ded61b0 R12: 00007fa75e90a770 >>>>> R13: 000000000000000e R14: 0000000000001770 R15: 0000000000000000 >>>>> >>>>> Fixes: 1cce4df04f37 ("ocfs2: do not lock/unlock() inode DLM lock") >>>>> Signed-off-by: Gang He >>>>> --- >>>>> fs/ocfs2/dlmglue.c | 9 +++++++++ >>>>> 1 file changed, 9 insertions(+) >>>>> >>>>> diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c >>>>> index 4689940..5193218 100644 >>>>> --- a/fs/ocfs2/dlmglue.c >>>>> +++ b/fs/ocfs2/dlmglue.c >>>>> @@ -2486,6 +2486,15 @@ int ocfs2_inode_lock_with_page(struct inode *inode, >>>>> ret = ocfs2_inode_lock_full(inode, ret_bh, ex, OCFS2_LOCK_NONBLOCK); >>>>> if (ret == -EAGAIN) { >>>>> unlock_page(page); >>>>> + /* >>>>> + * If we can't get inode lock immediately, we should not return >>>>> + * directly here, since this will lead to a softlockup problem. >>>>> + * The method is to get a blocking lock and immediately unlock >>>>> + * before returning, this can avoid CPU resource waste due to >>>>> + * lots of retries, and benefits fairness in getting lock. >>>>> + */ >>>>> + if (ocfs2_inode_lock(inode, ret_bh, ex) == 0) >>>>> + ocfs2_inode_unlock(inode, ex); >>>>> ret = AOP_TRUNCATED_PAGE; >>>>> } >>>>> >>>>> >>> >>> _______________________________________________ >>> Ocfs2-devel mailing list >>> Ocfs2-devel@oss.oracle.com >>> https://oss.oracle.com/mailman/listinfo/ocfs2-devel >>> >>> . >>> > > . >