Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753253AbdL1Jwl (ORCPT ); Thu, 28 Dec 2017 04:52:41 -0500 Received: from mail-oi0-f66.google.com ([209.85.218.66]:33607 "EHLO mail-oi0-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752010AbdL1Jwj (ORCPT ); Thu, 28 Dec 2017 04:52:39 -0500 X-Google-Smtp-Source: ACJfBosTHMrXrnln/sTVq4kukm2Iz6CROmk4D7Oygieaxhd+ou4lTS6RC6YTDxTbs5e570fOAkSJbw== Subject: Re: [Ocfs2-devel] [PATCH v2] ocfs2: try a blocking lock before return AOP_TRUNCATED_PAGE To: Gang He , mfasheh@versity.com, jlbec@evilplan.org Cc: linux-kernel@vger.kernel.org, ocfs2-devel@oss.oracle.com References: <1514447305-30814-1-git-send-email-ghe@suse.com> From: Joseph Qi Message-ID: Date: Thu, 28 Dec 2017 17:52:21 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:52.0) Gecko/20100101 Thunderbird/52.5.0 MIME-Version: 1.0 In-Reply-To: <1514447305-30814-1-git-send-email-ghe@suse.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6127 Lines: 128 On 17/12/28 15:48, Gang He wrote: > If we can't get inode lock immediately in the function > ocfs2_inode_lock_with_page() when reading a page, we should not > return directly here, since this will lead to a softlockup problem > when the kernel is configured with CONFIG_PREEMPT is not set. > The method is to get a blocking lock and immediately unlock before > returning, this can avoid CPU resource waste due to lots of retries, > and benefits fairness in getting lock among multiple nodes, increase > efficiency in case modifying the same file frequently from multiple > nodes. > The softlockup crash (when set /proc/sys/kernel/softlockup_panic to 1) > looks like, > Kernel panic - not syncing: softlockup: hung tasks > CPU: 0 PID: 885 Comm: multi_mmap Tainted: G L 4.12.14-6.1-default #1 > Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 > Call Trace: > > dump_stack+0x5c/0x82 > panic+0xd5/0x21e > watchdog_timer_fn+0x208/0x210 > ? watchdog_park_threads+0x70/0x70 > __hrtimer_run_queues+0xcc/0x200 > hrtimer_interrupt+0xa6/0x1f0 > smp_apic_timer_interrupt+0x34/0x50 > apic_timer_interrupt+0x96/0xa0 > > RIP: 0010:unlock_page+0x17/0x30 > RSP: 0000:ffffaf154080bc88 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10 > RAX: dead000000000100 RBX: fffff21e009f5300 RCX: 0000000000000004 > RDX: dead0000000000ff RSI: 0000000000000202 RDI: fffff21e009f5300 > RBP: 0000000000000000 R08: 0000000000000000 R09: ffffaf154080bb00 > R10: ffffaf154080bc30 R11: 0000000000000040 R12: ffff993749a39518 > R13: 0000000000000000 R14: fffff21e009f5300 R15: fffff21e009f5300 > ocfs2_inode_lock_with_page+0x25/0x30 [ocfs2] > ocfs2_readpage+0x41/0x2d0 [ocfs2] > ? pagecache_get_page+0x30/0x200 > filemap_fault+0x12b/0x5c0 > ? recalc_sigpending+0x17/0x50 > ? __set_task_blocked+0x28/0x70 > ? __set_current_blocked+0x3d/0x60 > ocfs2_fault+0x29/0xb0 [ocfs2] > __do_fault+0x1a/0xa0 > __handle_mm_fault+0xbe8/0x1090 > handle_mm_fault+0xaa/0x1f0 > __do_page_fault+0x235/0x4b0 > trace_do_page_fault+0x3c/0x110 > async_page_fault+0x28/0x30 > RIP: 0033:0x7fa75ded638e > RSP: 002b:00007ffd6657db18 EFLAGS: 00010287 > RAX: 000055c7662fb700 RBX: 0000000000000001 RCX: 000055c7662fb700 > RDX: 0000000000001770 RSI: 00007fa75e909000 RDI: 000055c7662fb700 > RBP: 0000000000000003 R08: 000000000000000e R09: 0000000000000000 > R10: 0000000000000483 R11: 00007fa75ded61b0 R12: 00007fa75e90a770 > R13: 000000000000000e R14: 0000000000001770 R15: 0000000000000000 > > About performance improvement, we can see the testing time is reduced, > and CPU utilization decreases, the detailed data is as follows. > I ran multi_mmap test case in ocfs2-test package in a three nodes cluster. > Before apply this patch, > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 2754 ocfs2te+ 20 0 170248 6980 4856 D 80.73 0.341 0:18.71 multi_mmap > 1505 root rt 0 222236 123060 97224 S 2.658 6.015 0:01.44 corosync > 5 root 20 0 0 0 0 S 1.329 0.000 0:00.19 kworker/u8:0 > 95 root 20 0 0 0 0 S 1.329 0.000 0:00.25 kworker/u8:1 > 2728 root 20 0 0 0 0 S 0.997 0.000 0:00.24 jbd2/sda1-33 > 2721 root 20 0 0 0 0 S 0.664 0.000 0:00.07 ocfs2dc-3C8CFD4 > 2750 ocfs2te+ 20 0 142976 4652 3532 S 0.664 0.227 0:00.28 mpirun > > ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o > ~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d > /dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared > Tests with "-b 4096 -C 32768" > Thu Dec 28 14:44:52 CST 2017 > multi_mmap..................................................Passed. > Runtime 783 seconds. > > After apply this patch, > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 2508 ocfs2te+ 20 0 170248 6804 4680 R 54.00 0.333 0:55.37 multi_mmap > 155 root 20 0 0 0 0 S 2.667 0.000 0:01.20 kworker/u8:3 > 95 root 20 0 0 0 0 S 2.000 0.000 0:01.58 kworker/u8:1 > 2504 ocfs2te+ 20 0 142976 4604 3480 R 1.667 0.225 0:01.65 mpirun > 5 root 20 0 0 0 0 S 1.000 0.000 0:01.36 kworker/u8:0 > 2482 root 20 0 0 0 0 S 1.000 0.000 0:00.86 jbd2/sda1-33 > 299 root 0 -20 0 0 0 S 0.333 0.000 0:00.13 kworker/2:1H > 335 root 0 -20 0 0 0 S 0.333 0.000 0:00.17 kworker/1:1H > 535 root 20 0 12140 7268 1456 S 0.333 0.355 0:00.34 haveged > 1282 root rt 0 222284 123108 97224 S 0.333 6.017 0:01.33 corosync > > ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o > ~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d > /dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared > Tests with "-b 4096 -C 32768" > Thu Dec 28 15:04:12 CST 2017 > multi_mmap..................................................Passed. > Runtime 487 seconds. > > Fixes: 1cce4df04f37 ("ocfs2: do not lock/unlock() inode DLM lock") > Signed-off-by: Gang He Reviewed-by: Joseph Qi > --- > fs/ocfs2/dlmglue.c | 9 +++++++++ > 1 file changed, 9 insertions(+) > > diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c > index 4689940..5193218 100644 > --- a/fs/ocfs2/dlmglue.c > +++ b/fs/ocfs2/dlmglue.c > @@ -2486,6 +2486,15 @@ int ocfs2_inode_lock_with_page(struct inode *inode, > ret = ocfs2_inode_lock_full(inode, ret_bh, ex, OCFS2_LOCK_NONBLOCK); > if (ret == -EAGAIN) { > unlock_page(page); > + /* > + * If we can't get inode lock immediately, we should not return > + * directly here, since this will lead to a softlockup problem. > + * The method is to get a blocking lock and immediately unlock > + * before returning, this can avoid CPU resource waste due to > + * lots of retries, and benefits fairness in getting lock. > + */ > + if (ocfs2_inode_lock(inode, ret_bh, ex) == 0) > + ocfs2_inode_unlock(inode, ex); > ret = AOP_TRUNCATED_PAGE; > } > >