LinuxLists.cc - [PATCH v2] ocfs2: try a blocking lock before return AOP_TRUNCATED

2017-12-28 07:48:39

Subject: [PATCH v2] ocfs2: try a blocking lock before return AOP_TRUNCATED_PAGE

If we can't get inode lock immediately in the function
ocfs2_inode_lock_with_page() when reading a page, we should not
return directly here, since this will lead to a softlockup problem
when the kernel is configured with CONFIG_PREEMPT is not set.
The method is to get a blocking lock and immediately unlock before
returning, this can avoid CPU resource waste due to lots of retries,
and benefits fairness in getting lock among multiple nodes, increase
efficiency in case modifying the same file frequently from multiple
nodes.
The softlockup crash (when set /proc/sys/kernel/softlockup_panic to 1)
looks like,
Kernel panic - not syncing: softlockup: hung tasks
CPU: 0 PID: 885 Comm: multi_mmap Tainted: G L 4.12.14-6.1-default #1
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
Call Trace:
<IRQ>
dump_stack+0x5c/0x82
panic+0xd5/0x21e
watchdog_timer_fn+0x208/0x210
? watchdog_park_threads+0x70/0x70
__hrtimer_run_queues+0xcc/0x200
hrtimer_interrupt+0xa6/0x1f0
smp_apic_timer_interrupt+0x34/0x50
apic_timer_interrupt+0x96/0xa0
</IRQ>
RIP: 0010:unlock_page+0x17/0x30
RSP: 0000:ffffaf154080bc88 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
RAX: dead000000000100 RBX: fffff21e009f5300 RCX: 0000000000000004
RDX: dead0000000000ff RSI: 0000000000000202 RDI: fffff21e009f5300
RBP: 0000000000000000 R08: 0000000000000000 R09: ffffaf154080bb00
R10: ffffaf154080bc30 R11: 0000000000000040 R12: ffff993749a39518
R13: 0000000000000000 R14: fffff21e009f5300 R15: fffff21e009f5300
ocfs2_inode_lock_with_page+0x25/0x30 [ocfs2]
ocfs2_readpage+0x41/0x2d0 [ocfs2]
? pagecache_get_page+0x30/0x200
filemap_fault+0x12b/0x5c0
? recalc_sigpending+0x17/0x50
? __set_task_blocked+0x28/0x70
? __set_current_blocked+0x3d/0x60
ocfs2_fault+0x29/0xb0 [ocfs2]
__do_fault+0x1a/0xa0
__handle_mm_fault+0xbe8/0x1090
handle_mm_fault+0xaa/0x1f0
__do_page_fault+0x235/0x4b0
trace_do_page_fault+0x3c/0x110
async_page_fault+0x28/0x30
RIP: 0033:0x7fa75ded638e
RSP: 002b:00007ffd6657db18 EFLAGS: 00010287
RAX: 000055c7662fb700 RBX: 0000000000000001 RCX: 000055c7662fb700
RDX: 0000000000001770 RSI: 00007fa75e909000 RDI: 000055c7662fb700
RBP: 0000000000000003 R08: 000000000000000e R09: 0000000000000000
R10: 0000000000000483 R11: 00007fa75ded61b0 R12: 00007fa75e90a770
R13: 000000000000000e R14: 0000000000001770 R15: 0000000000000000

About performance improvement, we can see the testing time is reduced,
and CPU utilization decreases, the detailed data is as follows.
I ran multi_mmap test case in ocfs2-test package in a three nodes cluster.
Before apply this patch,
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2754 ocfs2te+ 20 0 170248 6980 4856 D 80.73 0.341 0:18.71 multi_mmap
1505 root rt 0 222236 123060 97224 S 2.658 6.015 0:01.44 corosync
5 root 20 0 0 0 0 S 1.329 0.000 0:00.19 kworker/u8:0
95 root 20 0 0 0 0 S 1.329 0.000 0:00.25 kworker/u8:1
2728 root 20 0 0 0 0 S 0.997 0.000 0:00.24 jbd2/sda1-33
2721 root 20 0 0 0 0 S 0.664 0.000 0:00.07 ocfs2dc-3C8CFD4
2750 ocfs2te+ 20 0 142976 4652 3532 S 0.664 0.227 0:00.28 mpirun

ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o
~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d
/dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared
Tests with "-b 4096 -C 32768"
Thu Dec 28 14:44:52 CST 2017
multi_mmap..................................................Passed.
Runtime 783 seconds.

After apply this patch,
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2508 ocfs2te+ 20 0 170248 6804 4680 R 54.00 0.333 0:55.37 multi_mmap
155 root 20 0 0 0 0 S 2.667 0.000 0:01.20 kworker/u8:3
95 root 20 0 0 0 0 S 2.000 0.000 0:01.58 kworker/u8:1
2504 ocfs2te+ 20 0 142976 4604 3480 R 1.667 0.225 0:01.65 mpirun
5 root 20 0 0 0 0 S 1.000 0.000 0:01.36 kworker/u8:0
2482 root 20 0 0 0 0 S 1.000 0.000 0:00.86 jbd2/sda1-33
299 root 0 -20 0 0 0 S 0.333 0.000 0:00.13 kworker/2:1H
335 root 0 -20 0 0 0 S 0.333 0.000 0:00.17 kworker/1:1H
535 root 20 0 12140 7268 1456 S 0.333 0.355 0:00.34 haveged
1282 root rt 0 222284 123108 97224 S 0.333 6.017 0:01.33 corosync

ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o
~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d
/dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared
Tests with "-b 4096 -C 32768"
Thu Dec 28 15:04:12 CST 2017
multi_mmap..................................................Passed.
Runtime 487 seconds.

Fixes: 1cce4df04f37 ("ocfs2: do not lock/unlock() inode DLM lock")
Signed-off-by: Gang He <[email protected]>
---
fs/ocfs2/dlmglue.c | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
index 4689940..5193218 100644
--- a/fs/ocfs2/dlmglue.c
+++ b/fs/ocfs2/dlmglue.c
@@ -2486,6 +2486,15 @@ int ocfs2_inode_lock_with_page(struct inode *inode,
ret = ocfs2_inode_lock_full(inode, ret_bh, ex, OCFS2_LOCK_NONBLOCK);
if (ret == -EAGAIN) {
unlock_page(page);
+ /*
+ * If we can't get inode lock immediately, we should not return
+ * directly here, since this will lead to a softlockup problem.
+ * The method is to get a blocking lock and immediately unlock
+ * before returning, this can avoid CPU resource waste due to
+ * lots of retries, and benefits fairness in getting lock.
+ */
+ if (ocfs2_inode_lock(inode, ret_bh, ex) == 0)
+ ocfs2_inode_unlock(inode, ex);
ret = AOP_TRUNCATED_PAGE;
}

--
1.8.5.6

2017-12-28 08:18:02

by alex chen

[permalink] [raw]

Subject: Re: [Ocfs2-devel] [PATCH v2] ocfs2: try a blocking lock before return AOP_TRUNCATED_PAGE

Hi Gang,

It looks good to me.

Thanks,
Alex

On 2017/12/28 15:48, Gang He wrote:
> If we can't get inode lock immediately in the function
> ocfs2_inode_lock_with_page() when reading a page, we should not
> return directly here, since this will lead to a softlockup problem
> when the kernel is configured with CONFIG_PREEMPT is not set.
> The method is to get a blocking lock and immediately unlock before
> returning, this can avoid CPU resource waste due to lots of retries,
> and benefits fairness in getting lock among multiple nodes, increase
> efficiency in case modifying the same file frequently from multiple
> nodes.
> The softlockup crash (when set /proc/sys/kernel/softlockup_panic to 1)
> looks like,
> Kernel panic - not syncing: softlockup: hung tasks
> CPU: 0 PID: 885 Comm: multi_mmap Tainted: G L 4.12.14-6.1-default #1
> Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> Call Trace:
> <IRQ>
> dump_stack+0x5c/0x82
> panic+0xd5/0x21e
> watchdog_timer_fn+0x208/0x210
> ? watchdog_park_threads+0x70/0x70
> __hrtimer_run_queues+0xcc/0x200
> hrtimer_interrupt+0xa6/0x1f0
> smp_apic_timer_interrupt+0x34/0x50
> apic_timer_interrupt+0x96/0xa0
> </IRQ>
> RIP: 0010:unlock_page+0x17/0x30
> RSP: 0000:ffffaf154080bc88 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
> RAX: dead000000000100 RBX: fffff21e009f5300 RCX: 0000000000000004
> RDX: dead0000000000ff RSI: 0000000000000202 RDI: fffff21e009f5300
> RBP: 0000000000000000 R08: 0000000000000000 R09: ffffaf154080bb00
> R10: ffffaf154080bc30 R11: 0000000000000040 R12: ffff993749a39518
> R13: 0000000000000000 R14: fffff21e009f5300 R15: fffff21e009f5300
> ocfs2_inode_lock_with_page+0x25/0x30 [ocfs2]
> ocfs2_readpage+0x41/0x2d0 [ocfs2]
> ? pagecache_get_page+0x30/0x200
> filemap_fault+0x12b/0x5c0
> ? recalc_sigpending+0x17/0x50
> ? __set_task_blocked+0x28/0x70
> ? __set_current_blocked+0x3d/0x60
> ocfs2_fault+0x29/0xb0 [ocfs2]
> __do_fault+0x1a/0xa0
> __handle_mm_fault+0xbe8/0x1090
> handle_mm_fault+0xaa/0x1f0
> __do_page_fault+0x235/0x4b0
> trace_do_page_fault+0x3c/0x110
> async_page_fault+0x28/0x30
> RIP: 0033:0x7fa75ded638e
> RSP: 002b:00007ffd6657db18 EFLAGS: 00010287
> RAX: 000055c7662fb700 RBX: 0000000000000001 RCX: 000055c7662fb700
> RDX: 0000000000001770 RSI: 00007fa75e909000 RDI: 000055c7662fb700
> RBP: 0000000000000003 R08: 000000000000000e R09: 0000000000000000
> R10: 0000000000000483 R11: 00007fa75ded61b0 R12: 00007fa75e90a770
> R13: 000000000000000e R14: 0000000000001770 R15: 0000000000000000
>
> About performance improvement, we can see the testing time is reduced,
> and CPU utilization decreases, the detailed data is as follows.
> I ran multi_mmap test case in ocfs2-test package in a three nodes cluster.
> Before apply this patch,
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 2754 ocfs2te+ 20 0 170248 6980 4856 D 80.73 0.341 0:18.71 multi_mmap
> 1505 root rt 0 222236 123060 97224 S 2.658 6.015 0:01.44 corosync
> 5 root 20 0 0 0 0 S 1.329 0.000 0:00.19 kworker/u8:0
> 95 root 20 0 0 0 0 S 1.329 0.000 0:00.25 kworker/u8:1
> 2728 root 20 0 0 0 0 S 0.997 0.000 0:00.24 jbd2/sda1-33
> 2721 root 20 0 0 0 0 S 0.664 0.000 0:00.07 ocfs2dc-3C8CFD4
> 2750 ocfs2te+ 20 0 142976 4652 3532 S 0.664 0.227 0:00.28 mpirun
>
> ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o
> ~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d
> /dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared
> Tests with "-b 4096 -C 32768"
> Thu Dec 28 14:44:52 CST 2017
> multi_mmap..................................................Passed.
> Runtime 783 seconds.
>
> After apply this patch,
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 2508 ocfs2te+ 20 0 170248 6804 4680 R 54.00 0.333 0:55.37 multi_mmap
> 155 root 20 0 0 0 0 S 2.667 0.000 0:01.20 kworker/u8:3
> 95 root 20 0 0 0 0 S 2.000 0.000 0:01.58 kworker/u8:1
> 2504 ocfs2te+ 20 0 142976 4604 3480 R 1.667 0.225 0:01.65 mpirun
> 5 root 20 0 0 0 0 S 1.000 0.000 0:01.36 kworker/u8:0
> 2482 root 20 0 0 0 0 S 1.000 0.000 0:00.86 jbd2/sda1-33
> 299 root 0 -20 0 0 0 S 0.333 0.000 0:00.13 kworker/2:1H
> 335 root 0 -20 0 0 0 S 0.333 0.000 0:00.17 kworker/1:1H
> 535 root 20 0 12140 7268 1456 S 0.333 0.355 0:00.34 haveged
> 1282 root rt 0 222284 123108 97224 S 0.333 6.017 0:01.33 corosync
>
> ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o
> ~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d
> /dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared
> Tests with "-b 4096 -C 32768"
> Thu Dec 28 15:04:12 CST 2017
> multi_mmap..................................................Passed.
> Runtime 487 seconds.
>
> Fixes: 1cce4df04f37 ("ocfs2: do not lock/unlock() inode DLM lock")
> Signed-off-by: Gang He <[email protected]>

Reviewed-by: Alex Chen <[email protected]>

> ---
> fs/ocfs2/dlmglue.c | 9 +++++++++
> 1 file changed, 9 insertions(+)
>
> diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
> index 4689940..5193218 100644
> --- a/fs/ocfs2/dlmglue.c
> +++ b/fs/ocfs2/dlmglue.c
> @@ -2486,6 +2486,15 @@ int ocfs2_inode_lock_with_page(struct inode *inode,
> ret = ocfs2_inode_lock_full(inode, ret_bh, ex, OCFS2_LOCK_NONBLOCK);
> if (ret == -EAGAIN) {
> unlock_page(page);
> + /*
> + * If we can't get inode lock immediately, we should not return
> + * directly here, since this will lead to a softlockup problem.
> + * The method is to get a blocking lock and immediately unlock
> + * before returning, this can avoid CPU resource waste due to
> + * lots of retries, and benefits fairness in getting lock.
> + */
> + if (ocfs2_inode_lock(inode, ret_bh, ex) == 0)
> + ocfs2_inode_unlock(inode, ex);
> ret = AOP_TRUNCATED_PAGE;
> }
>
>

2017-12-28 09:52:41

by Joseph Qi

[permalink] [raw]

Subject: Re: [Ocfs2-devel] [PATCH v2] ocfs2: try a blocking lock before return AOP_TRUNCATED_PAGE

On 17/12/28 15:48, Gang He wrote:
> If we can't get inode lock immediately in the function
> ocfs2_inode_lock_with_page() when reading a page, we should not
> return directly here, since this will lead to a softlockup problem
> when the kernel is configured with CONFIG_PREEMPT is not set.
> The method is to get a blocking lock and immediately unlock before
> returning, this can avoid CPU resource waste due to lots of retries,
> and benefits fairness in getting lock among multiple nodes, increase
> efficiency in case modifying the same file frequently from multiple
> nodes.
> The softlockup crash (when set /proc/sys/kernel/softlockup_panic to 1)
> looks like,
> Kernel panic - not syncing: softlockup: hung tasks
> CPU: 0 PID: 885 Comm: multi_mmap Tainted: G L 4.12.14-6.1-default #1
> Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> Call Trace:
> <IRQ>
> dump_stack+0x5c/0x82
> panic+0xd5/0x21e
> watchdog_timer_fn+0x208/0x210
> ? watchdog_park_threads+0x70/0x70
> __hrtimer_run_queues+0xcc/0x200
> hrtimer_interrupt+0xa6/0x1f0
> smp_apic_timer_interrupt+0x34/0x50
> apic_timer_interrupt+0x96/0xa0
> </IRQ>
> RIP: 0010:unlock_page+0x17/0x30
> RSP: 0000:ffffaf154080bc88 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
> RAX: dead000000000100 RBX: fffff21e009f5300 RCX: 0000000000000004
> RDX: dead0000000000ff RSI: 0000000000000202 RDI: fffff21e009f5300
> RBP: 0000000000000000 R08: 0000000000000000 R09: ffffaf154080bb00
> R10: ffffaf154080bc30 R11: 0000000000000040 R12: ffff993749a39518
> R13: 0000000000000000 R14: fffff21e009f5300 R15: fffff21e009f5300
> ocfs2_inode_lock_with_page+0x25/0x30 [ocfs2]
> ocfs2_readpage+0x41/0x2d0 [ocfs2]
> ? pagecache_get_page+0x30/0x200
> filemap_fault+0x12b/0x5c0
> ? recalc_sigpending+0x17/0x50
> ? __set_task_blocked+0x28/0x70
> ? __set_current_blocked+0x3d/0x60
> ocfs2_fault+0x29/0xb0 [ocfs2]
> __do_fault+0x1a/0xa0
> __handle_mm_fault+0xbe8/0x1090
> handle_mm_fault+0xaa/0x1f0
> __do_page_fault+0x235/0x4b0
> trace_do_page_fault+0x3c/0x110
> async_page_fault+0x28/0x30
> RIP: 0033:0x7fa75ded638e
> RSP: 002b:00007ffd6657db18 EFLAGS: 00010287
> RAX: 000055c7662fb700 RBX: 0000000000000001 RCX: 000055c7662fb700
> RDX: 0000000000001770 RSI: 00007fa75e909000 RDI: 000055c7662fb700
> RBP: 0000000000000003 R08: 000000000000000e R09: 0000000000000000
> R10: 0000000000000483 R11: 00007fa75ded61b0 R12: 00007fa75e90a770
> R13: 000000000000000e R14: 0000000000001770 R15: 0000000000000000
>
> About performance improvement, we can see the testing time is reduced,
> and CPU utilization decreases, the detailed data is as follows.
> I ran multi_mmap test case in ocfs2-test package in a three nodes cluster.
> Before apply this patch,
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 2754 ocfs2te+ 20 0 170248 6980 4856 D 80.73 0.341 0:18.71 multi_mmap
> 1505 root rt 0 222236 123060 97224 S 2.658 6.015 0:01.44 corosync
> 5 root 20 0 0 0 0 S 1.329 0.000 0:00.19 kworker/u8:0
> 95 root 20 0 0 0 0 S 1.329 0.000 0:00.25 kworker/u8:1
> 2728 root 20 0 0 0 0 S 0.997 0.000 0:00.24 jbd2/sda1-33
> 2721 root 20 0 0 0 0 S 0.664 0.000 0:00.07 ocfs2dc-3C8CFD4
> 2750 ocfs2te+ 20 0 142976 4652 3532 S 0.664 0.227 0:00.28 mpirun
>
> ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o
> ~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d
> /dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared
> Tests with "-b 4096 -C 32768"
> Thu Dec 28 14:44:52 CST 2017
> multi_mmap..................................................Passed.
> Runtime 783 seconds.
>
> After apply this patch,
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 2508 ocfs2te+ 20 0 170248 6804 4680 R 54.00 0.333 0:55.37 multi_mmap
> 155 root 20 0 0 0 0 S 2.667 0.000 0:01.20 kworker/u8:3
> 95 root 20 0 0 0 0 S 2.000 0.000 0:01.58 kworker/u8:1
> 2504 ocfs2te+ 20 0 142976 4604 3480 R 1.667 0.225 0:01.65 mpirun
> 5 root 20 0 0 0 0 S 1.000 0.000 0:01.36 kworker/u8:0
> 2482 root 20 0 0 0 0 S 1.000 0.000 0:00.86 jbd2/sda1-33
> 299 root 0 -20 0 0 0 S 0.333 0.000 0:00.13 kworker/2:1H
> 335 root 0 -20 0 0 0 S 0.333 0.000 0:00.17 kworker/1:1H
> 535 root 20 0 12140 7268 1456 S 0.333 0.355 0:00.34 haveged
> 1282 root rt 0 222284 123108 97224 S 0.333 6.017 0:01.33 corosync
>
> ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o
> ~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d
> /dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared
> Tests with "-b 4096 -C 32768"
> Thu Dec 28 15:04:12 CST 2017
> multi_mmap..................................................Passed.
> Runtime 487 seconds.
>
> Fixes: 1cce4df04f37 ("ocfs2: do not lock/unlock() inode DLM lock")
> Signed-off-by: Gang He <[email protected]>
Reviewed-by: Joseph Qi <[email protected]>

> ---
> fs/ocfs2/dlmglue.c | 9 +++++++++
> 1 file changed, 9 insertions(+)
>
> diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
> index 4689940..5193218 100644
> --- a/fs/ocfs2/dlmglue.c
> +++ b/fs/ocfs2/dlmglue.c
> @@ -2486,6 +2486,15 @@ int ocfs2_inode_lock_with_page(struct inode *inode,
> ret = ocfs2_inode_lock_full(inode, ret_bh, ex, OCFS2_LOCK_NONBLOCK);
> if (ret == -EAGAIN) {
> unlock_page(page);
> + /*
> + * If we can't get inode lock immediately, we should not return
> + * directly here, since this will lead to a softlockup problem.
> + * The method is to get a blocking lock and immediately unlock
> + * before returning, this can avoid CPU resource waste due to
> + * lots of retries, and benefits fairness in getting lock.
> + */
> + if (ocfs2_inode_lock(inode, ret_bh, ex) == 0)
> + ocfs2_inode_unlock(inode, ex);
> ret = AOP_TRUNCATED_PAGE;
> }
>
>

2017-12-28 10:27:16

by piaojun

[permalink] [raw]

Subject: Re: [Ocfs2-devel] [PATCH v2] ocfs2: try a blocking lock before return AOP_TRUNCATED_PAGE

LGTM

On 2017/12/28 15:48, Gang He wrote:
> If we can't get inode lock immediately in the function
> ocfs2_inode_lock_with_page() when reading a page, we should not
> return directly here, since this will lead to a softlockup problem
> when the kernel is configured with CONFIG_PREEMPT is not set.
> The method is to get a blocking lock and immediately unlock before
> returning, this can avoid CPU resource waste due to lots of retries,
> and benefits fairness in getting lock among multiple nodes, increase
> efficiency in case modifying the same file frequently from multiple
> nodes.
> The softlockup crash (when set /proc/sys/kernel/softlockup_panic to 1)
> looks like,
> Kernel panic - not syncing: softlockup: hung tasks
> CPU: 0 PID: 885 Comm: multi_mmap Tainted: G L 4.12.14-6.1-default #1
> Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> Call Trace:
> <IRQ>
> dump_stack+0x5c/0x82
> panic+0xd5/0x21e
> watchdog_timer_fn+0x208/0x210
> ? watchdog_park_threads+0x70/0x70
> __hrtimer_run_queues+0xcc/0x200
> hrtimer_interrupt+0xa6/0x1f0
> smp_apic_timer_interrupt+0x34/0x50
> apic_timer_interrupt+0x96/0xa0
> </IRQ>
> RIP: 0010:unlock_page+0x17/0x30
> RSP: 0000:ffffaf154080bc88 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
> RAX: dead000000000100 RBX: fffff21e009f5300 RCX: 0000000000000004
> RDX: dead0000000000ff RSI: 0000000000000202 RDI: fffff21e009f5300
> RBP: 0000000000000000 R08: 0000000000000000 R09: ffffaf154080bb00
> R10: ffffaf154080bc30 R11: 0000000000000040 R12: ffff993749a39518
> R13: 0000000000000000 R14: fffff21e009f5300 R15: fffff21e009f5300
> ocfs2_inode_lock_with_page+0x25/0x30 [ocfs2]
> ocfs2_readpage+0x41/0x2d0 [ocfs2]
> ? pagecache_get_page+0x30/0x200
> filemap_fault+0x12b/0x5c0
> ? recalc_sigpending+0x17/0x50
> ? __set_task_blocked+0x28/0x70
> ? __set_current_blocked+0x3d/0x60
> ocfs2_fault+0x29/0xb0 [ocfs2]
> __do_fault+0x1a/0xa0
> __handle_mm_fault+0xbe8/0x1090
> handle_mm_fault+0xaa/0x1f0
> __do_page_fault+0x235/0x4b0
> trace_do_page_fault+0x3c/0x110
> async_page_fault+0x28/0x30
> RIP: 0033:0x7fa75ded638e
> RSP: 002b:00007ffd6657db18 EFLAGS: 00010287
> RAX: 000055c7662fb700 RBX: 0000000000000001 RCX: 000055c7662fb700
> RDX: 0000000000001770 RSI: 00007fa75e909000 RDI: 000055c7662fb700
> RBP: 0000000000000003 R08: 000000000000000e R09: 0000000000000000
> R10: 0000000000000483 R11: 00007fa75ded61b0 R12: 00007fa75e90a770
> R13: 000000000000000e R14: 0000000000001770 R15: 0000000000000000
>
> About performance improvement, we can see the testing time is reduced,
> and CPU utilization decreases, the detailed data is as follows.
> I ran multi_mmap test case in ocfs2-test package in a three nodes cluster.
> Before apply this patch,
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 2754 ocfs2te+ 20 0 170248 6980 4856 D 80.73 0.341 0:18.71 multi_mmap
> 1505 root rt 0 222236 123060 97224 S 2.658 6.015 0:01.44 corosync
> 5 root 20 0 0 0 0 S 1.329 0.000 0:00.19 kworker/u8:0
> 95 root 20 0 0 0 0 S 1.329 0.000 0:00.25 kworker/u8:1
> 2728 root 20 0 0 0 0 S 0.997 0.000 0:00.24 jbd2/sda1-33
> 2721 root 20 0 0 0 0 S 0.664 0.000 0:00.07 ocfs2dc-3C8CFD4
> 2750 ocfs2te+ 20 0 142976 4652 3532 S 0.664 0.227 0:00.28 mpirun
>
> ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o
> ~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d
> /dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared
> Tests with "-b 4096 -C 32768"
> Thu Dec 28 14:44:52 CST 2017
> multi_mmap..................................................Passed.
> Runtime 783 seconds.
>
> After apply this patch,
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 2508 ocfs2te+ 20 0 170248 6804 4680 R 54.00 0.333 0:55.37 multi_mmap
> 155 root 20 0 0 0 0 S 2.667 0.000 0:01.20 kworker/u8:3
> 95 root 20 0 0 0 0 S 2.000 0.000 0:01.58 kworker/u8:1
> 2504 ocfs2te+ 20 0 142976 4604 3480 R 1.667 0.225 0:01.65 mpirun
> 5 root 20 0 0 0 0 S 1.000 0.000 0:01.36 kworker/u8:0
> 2482 root 20 0 0 0 0 S 1.000 0.000 0:00.86 jbd2/sda1-33
> 299 root 0 -20 0 0 0 S 0.333 0.000 0:00.13 kworker/2:1H
> 335 root 0 -20 0 0 0 S 0.333 0.000 0:00.17 kworker/1:1H
> 535 root 20 0 12140 7268 1456 S 0.333 0.355 0:00.34 haveged
> 1282 root rt 0 222284 123108 97224 S 0.333 6.017 0:01.33 corosync
>
> ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o
> ~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d
> /dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared
> Tests with "-b 4096 -C 32768"
> Thu Dec 28 15:04:12 CST 2017
> multi_mmap..................................................Passed.
> Runtime 487 seconds.
>
> Fixes: 1cce4df04f37 ("ocfs2: do not lock/unlock() inode DLM lock")
> Signed-off-by: Gang He <[email protected]>
Reviewed-by: Jun Piao <[email protected]>
> ---
> fs/ocfs2/dlmglue.c | 9 +++++++++
> 1 file changed, 9 insertions(+)
>
> diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
> index 4689940..5193218 100644
> --- a/fs/ocfs2/dlmglue.c
> +++ b/fs/ocfs2/dlmglue.c
> @@ -2486,6 +2486,15 @@ int ocfs2_inode_lock_with_page(struct inode *inode,
> ret = ocfs2_inode_lock_full(inode, ret_bh, ex, OCFS2_LOCK_NONBLOCK);
> if (ret == -EAGAIN) {
> unlock_page(page);
> + /*
> + * If we can't get inode lock immediately, we should not return
> + * directly here, since this will lead to a softlockup problem.
> + * The method is to get a blocking lock and immediately unlock
> + * before returning, this can avoid CPU resource waste due to
> + * lots of retries, and benefits fairness in getting lock.
> + */
> + if (ocfs2_inode_lock(inode, ret_bh, ex) == 0)
> + ocfs2_inode_unlock(inode, ex);
> ret = AOP_TRUNCATED_PAGE;
> }
>
>

2018-01-05 06:31:18

by Gang He

[permalink] [raw]

Subject: Re: [Ocfs2-devel] [PATCH v2] ocfs2: try a blocking lock before return AOP_TRUNCATED_PAGE

Hi Andrew,

Happy new year.
Could you help to pick up this patch, which is used to fix a old patch 1cce4df04f37.
If we have not this patch, some multiple node test cases will trigger softlockup problems,
also make HA communication daemon (e.g. corosync) timeout and the node will has to be fenced.

Thanks
Gang

>>>

>
> On 17/12/28 15:48, Gang He wrote:
>> If we can't get inode lock immediately in the function
>> ocfs2_inode_lock_with_page() when reading a page, we should not
>> return directly here, since this will lead to a softlockup problem
>> when the kernel is configured with CONFIG_PREEMPT is not set.
>> The method is to get a blocking lock and immediately unlock before
>> returning, this can avoid CPU resource waste due to lots of retries,
>> and benefits fairness in getting lock among multiple nodes, increase
>> efficiency in case modifying the same file frequently from multiple
>> nodes.
>> The softlockup crash (when set /proc/sys/kernel/softlockup_panic to 1)
>> looks like,
>> Kernel panic - not syncing: softlockup: hung tasks
>> CPU: 0 PID: 885 Comm: multi_mmap Tainted: G L 4.12.14-6.1-default #1
>> Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
>> Call Trace:
>> <IRQ>
>> dump_stack+0x5c/0x82
>> panic+0xd5/0x21e
>> watchdog_timer_fn+0x208/0x210
>> ? watchdog_park_threads+0x70/0x70
>> __hrtimer_run_queues+0xcc/0x200
>> hrtimer_interrupt+0xa6/0x1f0
>> smp_apic_timer_interrupt+0x34/0x50
>> apic_timer_interrupt+0x96/0xa0
>> </IRQ>
>> RIP: 0010:unlock_page+0x17/0x30
>> RSP: 0000:ffffaf154080bc88 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
>> RAX: dead000000000100 RBX: fffff21e009f5300 RCX: 0000000000000004
>> RDX: dead0000000000ff RSI: 0000000000000202 RDI: fffff21e009f5300
>> RBP: 0000000000000000 R08: 0000000000000000 R09: ffffaf154080bb00
>> R10: ffffaf154080bc30 R11: 0000000000000040 R12: ffff993749a39518
>> R13: 0000000000000000 R14: fffff21e009f5300 R15: fffff21e009f5300
>> ocfs2_inode_lock_with_page+0x25/0x30 [ocfs2]
>> ocfs2_readpage+0x41/0x2d0 [ocfs2]
>> ? pagecache_get_page+0x30/0x200
>> filemap_fault+0x12b/0x5c0
>> ? recalc_sigpending+0x17/0x50
>> ? __set_task_blocked+0x28/0x70
>> ? __set_current_blocked+0x3d/0x60
>> ocfs2_fault+0x29/0xb0 [ocfs2]
>> __do_fault+0x1a/0xa0
>> __handle_mm_fault+0xbe8/0x1090
>> handle_mm_fault+0xaa/0x1f0
>> __do_page_fault+0x235/0x4b0
>> trace_do_page_fault+0x3c/0x110
>> async_page_fault+0x28/0x30
>> RIP: 0033:0x7fa75ded638e
>> RSP: 002b:00007ffd6657db18 EFLAGS: 00010287
>> RAX: 000055c7662fb700 RBX: 0000000000000001 RCX: 000055c7662fb700
>> RDX: 0000000000001770 RSI: 00007fa75e909000 RDI: 000055c7662fb700
>> RBP: 0000000000000003 R08: 000000000000000e R09: 0000000000000000
>> R10: 0000000000000483 R11: 00007fa75ded61b0 R12: 00007fa75e90a770
>> R13: 000000000000000e R14: 0000000000001770 R15: 0000000000000000
>>
>> About performance improvement, we can see the testing time is reduced,
>> and CPU utilization decreases, the detailed data is as follows.
>> I ran multi_mmap test case in ocfs2-test package in a three nodes cluster.
>> Before apply this patch,
>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>> 2754 ocfs2te+ 20 0 170248 6980 4856 D 80.73 0.341 0:18.71
> multi_mmap
>> 1505 root rt 0 222236 123060 97224 S 2.658 6.015 0:01.44
> corosync
>> 5 root 20 0 0 0 0 S 1.329 0.000 0:00.19
> kworker/u8:0
>> 95 root 20 0 0 0 0 S 1.329 0.000 0:00.25
> kworker/u8:1
>> 2728 root 20 0 0 0 0 S 0.997 0.000 0:00.24
> jbd2/sda1-33
>> 2721 root 20 0 0 0 0 S 0.664 0.000 0:00.07
> ocfs2dc-3C8CFD4
>> 2750 ocfs2te+ 20 0 142976 4652 3532 S 0.664 0.227 0:00.28 mpirun
>>
>> ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o
>> ~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d
>> /dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared
>> Tests with "-b 4096 -C 32768"
>> Thu Dec 28 14:44:52 CST 2017
>> multi_mmap..................................................Passed.
>> Runtime 783 seconds.
>>
>> After apply this patch,
>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>> 2508 ocfs2te+ 20 0 170248 6804 4680 R 54.00 0.333 0:55.37
> multi_mmap
>> 155 root 20 0 0 0 0 S 2.667 0.000 0:01.20
> kworker/u8:3
>> 95 root 20 0 0 0 0 S 2.000 0.000 0:01.58
> kworker/u8:1
>> 2504 ocfs2te+ 20 0 142976 4604 3480 R 1.667 0.225 0:01.65 mpirun
>> 5 root 20 0 0 0 0 S 1.000 0.000 0:01.36
> kworker/u8:0
>> 2482 root 20 0 0 0 0 S 1.000 0.000 0:00.86
> jbd2/sda1-33
>> 299 root 0 -20 0 0 0 S 0.333 0.000 0:00.13
> kworker/2:1H
>> 335 root 0 -20 0 0 0 S 0.333 0.000 0:00.17
> kworker/1:1H
>> 535 root 20 0 12140 7268 1456 S 0.333 0.355 0:00.34 haveged
>> 1282 root rt 0 222284 123108 97224 S 0.333 6.017 0:01.33
> corosync
>>
>> ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o
>> ~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d
>> /dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared
>> Tests with "-b 4096 -C 32768"
>> Thu Dec 28 15:04:12 CST 2017
>> multi_mmap..................................................Passed.
>> Runtime 487 seconds.
>>
>> Fixes: 1cce4df04f37 ("ocfs2: do not lock/unlock() inode DLM lock")
>> Signed-off-by: Gang He <[email protected]>
> Reviewed-by: Joseph Qi <[email protected]>
>
>> ---
>> fs/ocfs2/dlmglue.c | 9 +++++++++
>> 1 file changed, 9 insertions(+)
>>
>> diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
>> index 4689940..5193218 100644
>> --- a/fs/ocfs2/dlmglue.c
>> +++ b/fs/ocfs2/dlmglue.c
>> @@ -2486,6 +2486,15 @@ int ocfs2_inode_lock_with_page(struct inode *inode,
>> ret = ocfs2_inode_lock_full(inode, ret_bh, ex, OCFS2_LOCK_NONBLOCK);
>> if (ret == -EAGAIN) {
>> unlock_page(page);
>> + /*
>> + * If we can't get inode lock immediately, we should not return
>> + * directly here, since this will lead to a softlockup problem.
>> + * The method is to get a blocking lock and immediately unlock
>> + * before returning, this can avoid CPU resource waste due to
>> + * lots of retries, and benefits fairness in getting lock.
>> + */
>> + if (ocfs2_inode_lock(inode, ret_bh, ex) == 0)
>> + ocfs2_inode_unlock(inode, ex);
>> ret = AOP_TRUNCATED_PAGE;
>> }
>>
>>

2018-01-05 20:50:38

by Andrew Morton

[permalink] [raw]

Subject: Re: [Ocfs2-devel] [PATCH v2] ocfs2: try a blocking lock before return AOP_TRUNCATED_PAGE

On Thu, 04 Jan 2018 23:31:12 -0700 "Gang He" <[email protected]> wrote:

> Happy new year.
> Could you help to pick up this patch, which is used to fix a old patch 1cce4df04f37.
> If we have not this patch, some multiple node test cases will trigger softlockup problems,
> also make HA communication daemon (e.g. corosync) timeout and the node will has to be fenced.

I have the below queued for 4.16-rc1.

Is the problem seriosu enough to push this into 4.15? Should the fix
be backported into -stable kernels?

From: Gang He <[email protected]>
Subject: ocfs2: try a blocking lock before return AOP_TRUNCATED_PAGE

If we can't get inode lock immediately in the function
ocfs2_inode_lock_with_page() when reading a page, we should not return
directly here, since this will lead to a softlockup problem when the
kernel is configured with CONFIG_PREEMPT is not set. The method is to get
a blocking lock and immediately unlock before returning, this can avoid
CPU resource waste due to lots of retries, and benefits fairness in
getting lock among multiple nodes, increase efficiency in case modifying
the same file frequently from multiple nodes.

The softlockup crash (when set /proc/sys/kernel/softlockup_panic to 1)
looks like:

Kernel panic - not syncing: softlockup: hung tasks
CPU: 0 PID: 885 Comm: multi_mmap Tainted: G L 4.12.14-6.1-default #1
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
Call Trace:
<IRQ>
dump_stack+0x5c/0x82
panic+0xd5/0x21e
watchdog_timer_fn+0x208/0x210
? watchdog_park_threads+0x70/0x70
__hrtimer_run_queues+0xcc/0x200
hrtimer_interrupt+0xa6/0x1f0
smp_apic_timer_interrupt+0x34/0x50
apic_timer_interrupt+0x96/0xa0
</IRQ>
RIP: 0010:unlock_page+0x17/0x30
RSP: 0000:ffffaf154080bc88 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
RAX: dead000000000100 RBX: fffff21e009f5300 RCX: 0000000000000004
RDX: dead0000000000ff RSI: 0000000000000202 RDI: fffff21e009f5300
RBP: 0000000000000000 R08: 0000000000000000 R09: ffffaf154080bb00
R10: ffffaf154080bc30 R11: 0000000000000040 R12: ffff993749a39518
R13: 0000000000000000 R14: fffff21e009f5300 R15: fffff21e009f5300
ocfs2_inode_lock_with_page+0x25/0x30 [ocfs2]
ocfs2_readpage+0x41/0x2d0 [ocfs2]
? pagecache_get_page+0x30/0x200
filemap_fault+0x12b/0x5c0
? recalc_sigpending+0x17/0x50
? __set_task_blocked+0x28/0x70
? __set_current_blocked+0x3d/0x60
ocfs2_fault+0x29/0xb0 [ocfs2]
__do_fault+0x1a/0xa0
__handle_mm_fault+0xbe8/0x1090
handle_mm_fault+0xaa/0x1f0
__do_page_fault+0x235/0x4b0
trace_do_page_fault+0x3c/0x110
async_page_fault+0x28/0x30
RIP: 0033:0x7fa75ded638e
RSP: 002b:00007ffd6657db18 EFLAGS: 00010287
RAX: 000055c7662fb700 RBX: 0000000000000001 RCX: 000055c7662fb700
RDX: 0000000000001770 RSI: 00007fa75e909000 RDI: 000055c7662fb700
RBP: 0000000000000003 R08: 000000000000000e R09: 0000000000000000
R10: 0000000000000483 R11: 00007fa75ded61b0 R12: 00007fa75e90a770
R13: 000000000000000e R14: 0000000000001770 R15: 0000000000000000

About performance improvement, we can see the testing time is reduced, and
CPU utilization decreases, the detailed data is as follows. I ran
multi_mmap test case in ocfs2-test package in a three nodes cluster.

Before applying this patch:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2754 ocfs2te+ 20 0 170248 6980 4856 D 80.73 0.341 0:18.71 multi_mmap
1505 root rt 0 222236 123060 97224 S 2.658 6.015 0:01.44 corosync
5 root 20 0 0 0 0 S 1.329 0.000 0:00.19 kworker/u8:0
95 root 20 0 0 0 0 S 1.329 0.000 0:00.25 kworker/u8:1
2728 root 20 0 0 0 0 S 0.997 0.000 0:00.24 jbd2/sda1-33
2721 root 20 0 0 0 0 S 0.664 0.000 0:00.07 ocfs2dc-3C8CFD4
2750 ocfs2te+ 20 0 142976 4652 3532 S 0.664 0.227 0:00.28 mpirun

ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o
~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d
/dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared
Tests with "-b 4096 -C 32768"
Thu Dec 28 14:44:52 CST 2017
multi_mmap..................................................Passed.
Runtime 783 seconds.

After apply this patch:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2508 ocfs2te+ 20 0 170248 6804 4680 R 54.00 0.333 0:55.37 multi_mmap
155 root 20 0 0 0 0 S 2.667 0.000 0:01.20 kworker/u8:3
95 root 20 0 0 0 0 S 2.000 0.000 0:01.58 kworker/u8:1
2504 ocfs2te+ 20 0 142976 4604 3480 R 1.667 0.225 0:01.65 mpirun
5 root 20 0 0 0 0 S 1.000 0.000 0:01.36 kworker/u8:0
2482 root 20 0 0 0 0 S 1.000 0.000 0:00.86 jbd2/sda1-33
299 root 0 -20 0 0 0 S 0.333 0.000 0:00.13 kworker/2:1H
335 root 0 -20 0 0 0 S 0.333 0.000 0:00.17 kworker/1:1H
535 root 20 0 12140 7268 1456 S 0.333 0.355 0:00.34 haveged
1282 root rt 0 222284 123108 97224 S 0.333 6.017 0:01.33 corosync

ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o
~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d
/dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared
Tests with "-b 4096 -C 32768"
Thu Dec 28 15:04:12 CST 2017
multi_mmap..................................................Passed.
Runtime 487 seconds.

Link: http://lkml.kernel.org/r/[email protected]
Fixes: 1cce4df04f37 ("ocfs2: do not lock/unlock() inode DLM lock")
Signed-off-by: Gang He <[email protected]>
Reviewed-by: Eric Ren <[email protected]>
Acked-by: alex chen <[email protected]>
Acked-by: piaojun <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Junxiao Bi <[email protected]>
Cc: Joseph Qi <[email protected]>
Cc: Changwei Ge <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---

fs/ocfs2/dlmglue.c | 9 +++++++++
1 file changed, 9 insertions(+)

diff -puN fs/ocfs2/dlmglue.c~ocfs2-try-a-blocking-lock-before-return-aop_truncated_page fs/ocfs2/dlmglue.c
--- a/fs/ocfs2/dlmglue.c~ocfs2-try-a-blocking-lock-before-return-aop_truncated_page
+++ a/fs/ocfs2/dlmglue.c
@@ -2529,6 +2529,15 @@ int ocfs2_inode_lock_with_page(struct in
ret = ocfs2_inode_lock_full(inode, ret_bh, ex, OCFS2_LOCK_NONBLOCK);
if (ret == -EAGAIN) {
unlock_page(page);
+ /*
+ * If we can't get inode lock immediately, we should not return
+ * directly here, since this will lead to a softlockup problem.
+ * The method is to get a blocking lock and immediately unlock
+ * before returning, this can avoid CPU resource waste due to
+ * lots of retries, and benefits fairness in getting lock.
+ */
+ if (ocfs2_inode_lock(inode, ret_bh, ex) == 0)
+ ocfs2_inode_unlock(inode, ex);
ret = AOP_TRUNCATED_PAGE;
}

_

2018-01-06 02:46:11

by Gang He

[permalink] [raw]

Subject: Re: [Ocfs2-devel] [PATCH v2] ocfs2: try a blocking lock before return AOP_TRUNCATED_PAGE

Hi Andrew,

>>> Andrew Morton <[email protected]> 01/06/18 4:50 AM >>>
On Thu, 04 Jan 2018 23:31:12 -0700 "Gang He" <[email protected]> wrote:

> Happy new year.
> Could you help to pick up this patch, which is used to fix a old patch 1cce4df04f37.
> If we have not this patch, some multiple node test cases will trigger softlockup problems,
> also make HA communication daemon (e.g. corosync) timeout and the node will has to be fenced.

I have the below queued for 4.16-rc1.

Is the problem seriosu enough to push this into 4.15?If possible, please do that, since it can bring the system crash or fence in some test cases.

Should the fix be backported into -stable kernels?
Yes, I feel it can be considered as a regression problem.
Thanks a lot.
Gang

From: Gang He <[email protected]>
Subject: ocfs2: try a blocking lock before return AOP_TRUNCATED_PAGE

If we can't get inode lock immediately in the function
ocfs2_inode_lock_with_page() when reading a page, we should not return
directly here, since this will lead to a softlockup problem when the
kernel is configured with CONFIG_PREEMPT is not set. The method is to get
a blocking lock and immediately unlock before returning, this can avoid
CPU resource waste due to lots of retries, and benefits fairness in
getting lock among multiple nodes, increase efficiency in case modifying
the same file frequently from multiple nodes.

The softlockup crash (when set /proc/sys/kernel/softlockup_panic to 1)
looks like:

Kernel panic - not syncing: softlockup: hung tasks
CPU: 0 PID: 885 Comm: multi_mmap Tainted: G L 4.12.14-6.1-default #1
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
Call Trace:
<IRQ>
dump_stack+0x5c/0x82
panic+0xd5/0x21e
watchdog_timer_fn+0x208/0x210
? watchdog_park_threads+0x70/0x70
__hrtimer_run_queues+0xcc/0x200
hrtimer_interrupt+0xa6/0x1f0
smp_apic_timer_interrupt+0x34/0x50
apic_timer_interrupt+0x96/0xa0
</IRQ>
RIP: 0010:unlock_page+0x17/0x30
RSP: 0000:ffffaf154080bc88 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
RAX: dead000000000100 RBX: fffff21e009f5300 RCX: 0000000000000004
RDX: dead0000000000ff RSI: 0000000000000202 RDI: fffff21e009f5300
RBP: 0000000000000000 R08: 0000000000000000 R09: ffffaf154080bb00
R10: ffffaf154080bc30 R11: 0000000000000040 R12: ffff993749a39518
R13: 0000000000000000 R14: fffff21e009f5300 R15: fffff21e009f5300
ocfs2_inode_lock_with_page+0x25/0x30 [ocfs2]
ocfs2_readpage+0x41/0x2d0 [ocfs2]
? pagecache_get_page+0x30/0x200
filemap_fault+0x12b/0x5c0
? recalc_sigpending+0x17/0x50
? __set_task_blocked+0x28/0x70
? __set_current_blocked+0x3d/0x60
ocfs2_fault+0x29/0xb0 [ocfs2]
__do_fault+0x1a/0xa0
__handle_mm_fault+0xbe8/0x1090
handle_mm_fault+0xaa/0x1f0
__do_page_fault+0x235/0x4b0
trace_do_page_fault+0x3c/0x110
async_page_fault+0x28/0x30
RIP: 0033:0x7fa75ded638e
RSP: 002b:00007ffd6657db18 EFLAGS: 00010287
RAX: 000055c7662fb700 RBX: 0000000000000001 RCX: 000055c7662fb700
RDX: 0000000000001770 RSI: 00007fa75e909000 RDI: 000055c7662fb700
RBP: 0000000000000003 R08: 000000000000000e R09: 0000000000000000
R10: 0000000000000483 R11: 00007fa75ded61b0 R12: 00007fa75e90a770
R13: 000000000000000e R14: 0000000000001770 R15: 0000000000000000

About performance improvement, we can see the testing time is reduced, and
CPU utilization decreases, the detailed data is as follows. I ran
multi_mmap test case in ocfs2-test package in a three nodes cluster.

Before applying this patch:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2754 ocfs2te+ 20 0 170248 6980 4856 D 80.73 0.341 0:18.71 multi_mmap
1505 root rt 0 222236 123060 97224 S 2.658 6.015 0:01.44 corosync
5 root 20 0 0 0 0 S 1.329 0.000 0:00.19 kworker/u8:0
95 root 20 0 0 0 0 S 1.329 0.000 0:00.25 kworker/u8:1
2728 root 20 0 0 0 0 S 0.997 0.000 0:00.24 jbd2/sda1-33
2721 root 20 0 0 0 0 S 0.664 0.000 0:00.07 ocfs2dc-3C8CFD4
2750 ocfs2te+ 20 0 142976 4652 3532 S 0.664 0.227 0:00.28 mpirun

ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o
~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d
/dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared
Tests with "-b 4096 -C 32768"
Thu Dec 28 14:44:52 CST 2017
multi_mmap..................................................Passed.
Runtime 783 seconds.

After apply this patch:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2508 ocfs2te+ 20 0 170248 6804 4680 R 54.00 0.333 0:55.37 multi_mmap
155 root 20 0 0 0 0 S 2.667 0.000 0:01.20 kworker/u8:3
95 root 20 0 0 0 0 S 2.000 0.000 0:01.58 kworker/u8:1
2504 ocfs2te+ 20 0 142976 4604 3480 R 1.667 0.225 0:01.65 mpirun
5 root 20 0 0 0 0 S 1.000 0.000 0:01.36 kworker/u8:0
2482 root 20 0 0 0 0 S 1.000 0.000 0:00.86 jbd2/sda1-33
299 root 0 -20 0 0 0 S 0.333 0.000 0:00.13 kworker/2:1H
335 root 0 -20 0 0 0 S 0.333 0.000 0:00.17 kworker/1:1H
535 root 20 0 12140 7268 1456 S 0.333 0.355 0:00.34 haveged
1282 root rt 0 222284 123108 97224 S 0.333 6.017 0:01.33 corosync

ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o
~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d
/dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared
Tests with "-b 4096 -C 32768"
Thu Dec 28 15:04:12 CST 2017
multi_mmap..................................................Passed.
Runtime 487 seconds.

Link: http://lkml.kernel.org/r/[email protected]
Fixes: 1cce4df04f37 ("ocfs2: do not lock/unlock() inode DLM lock")
Signed-off-by: Gang He <[email protected]>
Reviewed-by: Eric Ren <[email protected]>
Acked-by: alex chen <[email protected]>
Acked-by: piaojun <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Junxiao Bi <[email protected]>
Cc: Joseph Qi <[email protected]>
Cc: Changwei Ge <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---

fs/ocfs2/dlmglue.c | 9 +++++++++
1 file changed, 9 insertions(+)

diff -puN fs/ocfs2/dlmglue.c~ocfs2-try-a-blocking-lock-before-return-aop_truncated_page fs/ocfs2/dlmglue.c
--- a/fs/ocfs2/dlmglue.c~ocfs2-try-a-blocking-lock-before-return-aop_truncated_page
+++ a/fs/ocfs2/dlmglue.c
@@ -2529,6 +2529,15 @@ int ocfs2_inode_lock_with_page(struct in
ret = ocfs2_inode_lock_full(inode, ret_bh, ex, OCFS2_LOCK_NONBLOCK);
if (ret == -EAGAIN) {
unlock_page(page);
+ /*
+ * If we can't get inode lock immediately, we should not return
+ * directly here, since this will lead to a softlockup problem.
+ * The method is to get a blocking lock and immediately unlock
+ * before returning, this can avoid CPU resource waste due to
+ * lots of retries, and benefits fairness in getting lock.
+ */
+ if (ocfs2_inode_lock(inode, ret_bh, ex) == 0)
+ ocfs2_inode_unlock(inode, ex);
ret = AOP_TRUNCATED_PAGE;
}

_