2022-03-21 22:14:36

by Zi Yan

[permalink] [raw]
Subject: [RFC PATCH 4/5] mm: truncate: split huge page cache page to a non-zero order if possible.

From: Zi Yan <[email protected]>

To minimize the number of pages after a huge page truncation, we do not
need to split it all the way down to order-0. The huge page has at most
three parts, the part before offset, the part to be truncated, the part
remaining at the end. Find the greatest common power of two multiplier of
the non-zero values of them as the new order, so we can split the huge
page to this order and keep the remaining pages as large and as few as
possible.

Signed-off-by: Zi Yan <[email protected]>
---
mm/huge_memory.c | 1 +
mm/truncate.c | 33 +++++++++++++++++++++++++++++++--
2 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3617aa3ad0b1..76db0092a1e2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2349,6 +2349,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
prep_compound_page(page_tail, new_order);
prep_transhuge_page(page_tail);
}
+ VM_BUG_ON_PAGE(PageTail(page_tail), page_tail);

/* Finally unfreeze refcount. Additional reference from page cache. */
page_ref_unfreeze(page_tail, 1 + ((!PageAnon(head) ||
diff --git a/mm/truncate.c b/mm/truncate.c
index ab50d0d59a2a..4f71e67dec09 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -197,6 +197,14 @@ int truncate_inode_folio(struct address_space *mapping, struct folio *folio)
return 0;
}

+static unsigned int greatest_pow_of_two_multiplier(unsigned int num)
+{
+ if (num & 1)
+ return 0;
+ return min_t(unsigned int, ilog2(num),
+ ilog2(num - rounddown_pow_of_two(num)));
+}
+
/*
* Handle partial folios. The folio may be entirely within the
* range if a split has raced with us. If not, we zero the part of the
@@ -211,7 +219,8 @@ int truncate_inode_folio(struct address_space *mapping, struct folio *folio)
bool truncate_inode_partial_folio(struct folio *folio, loff_t start, loff_t end)
{
loff_t pos = folio_pos(folio);
- unsigned int offset, length;
+ unsigned int offset, length, remaining;
+ unsigned int new_order = folio_order(folio);

if (pos < start)
offset = start - pos;
@@ -222,6 +231,7 @@ bool truncate_inode_partial_folio(struct folio *folio, loff_t start, loff_t end)
length = length - offset;
else
length = end + 1 - pos - offset;
+ remaining = folio_size(folio) - offset - length;

folio_wait_writeback(folio);
if (length == folio_size(folio)) {
@@ -236,11 +246,30 @@ bool truncate_inode_partial_folio(struct folio *folio, loff_t start, loff_t end)
*/
folio_zero_range(folio, offset, length);

+ /*
+ * Find the greatest common power of two multiplier of the non-zero
+ * offset, length, and remaining as the new order. So we can truncate
+ * a subpage as large as possible.
+ */
+ if (offset)
+ new_order = greatest_pow_of_two_multiplier(offset / PAGE_SIZE);
+ if (length)
+ new_order = min_t(unsigned int, new_order,
+ greatest_pow_of_two_multiplier(length / PAGE_SIZE));
+ if (remaining)
+ new_order = min_t(unsigned int, new_order,
+ greatest_pow_of_two_multiplier(remaining / PAGE_SIZE));
+
+ /* order-1 THP not supported, downgrade to order-0 */
+ if (new_order == 1)
+ new_order = 0;
+
+
if (folio_has_private(folio))
folio_invalidate(folio, offset, length);
if (!folio_test_large(folio))
return true;
- if (split_huge_page(&folio->page) == 0)
+ if (split_huge_page_to_list_to_order(&folio->page, NULL, new_order) == 0)
return true;
if (folio_test_dirty(folio))
return false;
--
2.35.1


2022-03-21 23:40:47

by Roman Gushchin

[permalink] [raw]
Subject: Re: [RFC PATCH 4/5] mm: truncate: split huge page cache page to a non-zero order if possible.

On Mon, Mar 21, 2022 at 10:21:27AM -0400, Zi Yan wrote:
> From: Zi Yan <[email protected]>
>
> To minimize the number of pages after a huge page truncation, we do not
> need to split it all the way down to order-0. The huge page has at most
> three parts, the part before offset, the part to be truncated, the part
> remaining at the end. Find the greatest common power of two multiplier of
> the non-zero values of them as the new order, so we can split the huge
> page to this order and keep the remaining pages as large and as few as
> possible.

Would you mind please to describe the algorithm in more details?

Thanks!

2022-03-22 18:32:32

by Zi Yan

[permalink] [raw]
Subject: Re: [RFC PATCH 4/5] mm: truncate: split huge page cache page to a non-zero order if possible.

On 21 Mar 2022, at 18:32, Roman Gushchin wrote:

> On Mon, Mar 21, 2022 at 10:21:27AM -0400, Zi Yan wrote:
>> From: Zi Yan <[email protected]>
>>
>> To minimize the number of pages after a huge page truncation, we do not
>> need to split it all the way down to order-0. The huge page has at most
>> three parts, the part before offset, the part to be truncated, the part
>> remaining at the end. Find the greatest common power of two multiplier of
>> the non-zero values of them as the new order, so we can split the huge
>> page to this order and keep the remaining pages as large and as few as
>> possible.
>
> Would you mind please to describe the algorithm in more details?

Sure.

During truncation, there can be three parts in a huge page:
1. the _offset_ from the beginning of the huge page,
2. the _length_ of the to-be-truncated part,
3. the _remaining_ part after the to-be-truncated part.

the size of the split huge page need to be the greatest common divisor
of the non-zero ones of three after being rounded down to power of two.

OK, I actually find there is a gcd function. I think the algorithm can
simplified to

new_order = ilog2(gcd(gcd(offset, length), remaining)) - PAGE_SHIFT;

I will update the code, the commit message, and the comment in the next
version.

Thank you for the comment.

--
Best Regards,
Yan, Zi


Attachments:
signature.asc (871.00 B)
OpenPGP digital signature

2022-03-23 12:30:07

by kernel test robot

[permalink] [raw]
Subject: [mm] 2757cee2d6: UBSAN:shift-out-of-bounds_in_include/linux/log2.h



Greeting,

FYI, we noticed the following commit (built with gcc-9):

commit: 2757cee2d6c6c76f672ec6566ade2dcb8c1605dd ("[RFC PATCH 4/5] mm: truncate: split huge page cache page to a non-zero order if possible.")
url: https://github.com/0day-ci/linux/commits/Zi-Yan/Split-a-huge-page-to-any-lower-order-pages/20220321-222304
base: https://github.com/hnaz/linux-mm master
patch link: https://lore.kernel.org/linux-mm/[email protected]

in testcase: boot

on test machine: qemu-system-x86_64 -enable-kvm -cpu Icelake-Server -smp 4 -m 16G

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):



If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>


[ 25.633813][ T275] UBSAN: shift-out-of-bounds in include/linux/log2.h:67:13
[ 25.635404][ T275] shift exponent 4294967295 is too large for 32-bit type 'long unsigned int'
[ 25.636781][ T275] CPU: 3 PID: 275 Comm: cron Not tainted 5.17.0-rc8-mm1-00485-g2757cee2d6c6 #1
[ 25.638246][ T275] Call Trace:
[ 25.638768][ T275] dump_stack_lvl (lib/dump_stack.c:107)
[ 25.651568][ T275] dump_stack (lib/dump_stack.c:114)
[ 25.652209][ T275] ubsan_epilogue (lib/ubsan.c:152)
[ 25.652926][ T275] __ubsan_handle_shift_out_of_bounds.cold (arch/x86/include/asm/smap.h:85)
[ 25.654034][ T275] ? lock_release (kernel/locking/lockdep.c:5348 kernel/locking/lockdep.c:5692)
[ 25.654781][ T275] ? __kmap_local_pfn_prot (mm/highmem.c:532)
[ 25.655650][ T275] ? kunmap_local_indexed (mm/highmem.c:600 (discriminator 3))
[ 25.656382][ T275] ? zero_user_segments (mm/highmem.c:408)
[ 25.657110][ T275] greatest_pow_of_two_multiplier.cold (include/linux/log2.h:67 mm/truncate.c:204)
[ 25.658134][ T275] truncate_inode_partial_folio (mm/truncate.c:255)
g System Logging[ 25.659902][ T275] shmem_undo_range (mm/shmem.c:966)
Service...
[ 25.660789][ T275] ? zero_user_segments (mm/highmem.c:408)
[ 25.661744][ T275] ? folio_mark_dirty (mm/page-writeback.c:2717)
[ 25.662480][ T275] ? unlock_page (mm/folio-compat.c:21)
[ 25.663179][ T275] ? __lock_acquire (kernel/locking/lockdep.c:5060)
[ 25.668235][ T275] shmem_truncate_range (mm/shmem.c:1045)
[ 25.668251][ T275] ? setattr_prepare (fs/attr.c:108)
[ 25.668256][ T275] ? mark_held_locks (kernel/locking/lockdep.c:4239)
Startin[ 25.668262][ T275] shmem_setattr (mm/shmem.c:1109)
g /etc/rc.local [ 25.675802][ T275] ? shmem_setattr (mm/shmem.c:1109)
Compatibility...[ 25.676621][ T275] ? current_time (fs/inode.c:2406)

[ 25.677351][ T275] notify_change (fs/attr.c:414)
[ 25.677989][ T275] ? shmem_evict_inode (mm/shmem.c:1077)
[ 25.678735][ T275] ? notify_change (fs/attr.c:414)
[ 25.679481][ T275] ? lock_acquire (kernel/locking/lockdep.c:467 kernel/locking/lockdep.c:5674)
[ 25.680186][ T275] do_truncate (fs/open.c:67)
[ 25.680843][ T275] ? do_truncate (fs/open.c:67)
[ 25.681520][ T275] ? do_truncate (fs/open.c:67)
[ 25.682230][ T275] do_sys_ftruncate (fs/open.c:197)
[ 25.683001][ T275] __ia32_sys_ftruncate (fs/open.c:204)
[ 25.684027][ T275] __do_fast_syscall_32 (arch/x86/entry/common.c:112 arch/x86/entry/common.c:178)
[ 25.684801][ T275] ? irqentry_exit_to_user_mode (kernel/entry/common.c:324)
[ 25.685720][ T275] do_fast_syscall_32 (arch/x86/entry/common.c:203)
[ 25.686464][ T275] do_SYSENTER_32 (arch/x86/entry/common.c:247)
[ 25.687172][ T275] entry_SYSENTER_32 (arch/x86/entry/entry_32.S:869)
[ 25.687180][ T275] EIP: 0xa7f00549
[ 25.687184][ T275] Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90 8d 76
All code
========
0: 03 74 c0 01 add 0x1(%rax,%rax,8),%esi
4: 10 05 03 74 b8 01 adc %al,0x1b87403(%rip) # 0x1b8740d
a: 10 06 adc %al,(%rsi)
c: 03 74 b4 01 add 0x1(%rsp,%rsi,4),%esi
10: 10 07 adc %al,(%rdi)
12: 03 74 b0 01 add 0x1(%rax,%rsi,4),%esi
16: 10 08 adc %cl,(%rax)
18: 03 74 d8 01 add 0x1(%rax,%rbx,8),%esi
1c: 00 00 add %al,(%rax)
1e: 00 00 add %al,(%rax)
20: 00 51 52 add %dl,0x52(%rcx)
23: 55 push %rbp
24: 89 e5 mov %esp,%ebp
26: 0f 34 sysenter
28: cd 80 int $0x80
2a:* 5d pop %rbp <-- trapping instruction
2b: 5a pop %rdx
2c: 59 pop %rcx
2d: c3 retq
2e: 90 nop
2f: 90 nop
30: 90 nop
31: 90 nop
32: 8d 76 00 lea 0x0(%rsi),%esi
35: 58 pop %rax
36: b8 77 00 00 00 mov $0x77,%eax
3b: cd 80 int $0x80
3d: 90 nop
3e: 8d .byte 0x8d
3f: 76 .byte 0x76

Code starting with the faulting instruction
===========================================
0: 5d pop %rbp
1: 5a pop %rdx
2: 59 pop %rcx
3: c3 retq
4: 90 nop
5: 90 nop
6: 90 nop
7: 90 nop
8: 8d 76 00 lea 0x0(%rsi),%esi
b: 58 pop %rax
c: b8 77 00 00 00 mov $0x77,%eax
11: cd 80 int $0x80
13: 90 nop
14: 8d .byte 0x8d
15: 76 .byte 0x76
[ 25.687187][ T275] EAX: ffffffda EBX: 00000003 ECX: 00000004 EDX: 00453000
[ 25.687190][ T275] ESI: 00000004 EDI: 00000002 EBP: 0207e168 ESP: aff7174c
[ 25.687193][ T275] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000292
[ 25.687280][ T275] ================================================================================
Starting LKP bootstrap...
[ OK ] Started D-Bus System Message Bus.
[ 25.719619] rc.local[280]: mkdir: cannot create directory '/var/lock/lkp-bootstrap.lock': File exists
[ OK ] Started Daily Cleanup of Temporary Directories.
Starting Login Service...
[ OK ] Started Daily apt upgrade and clean activities.
[ OK ] Reached target Timers.
Starting LSB: Execute the kexec -e command to reboot system...
Starting OpenBSD Secure Shell server...
Starting Permit User Sessions...
[ OK ] Started LKP bootstrap.
[ OK ] Started Permit User Sessions.
[ OK ] Started LSB: Start and stop bmc-watchdog.
[ OK ] Started OpenBSD Secure Shell server.
[ OK ] Started LSB: Execute the kexec -e command to reboot system.
[ OK ] Started Login Service.
Starting LSB: Load kernel image with kexec...
LKP: ttyS0: 290: Kernel tests: Boot OK!
LKP: ttyS0: 290: HOSTNAME vm-icl-74, MAC 52:54:00:12:34:56, kernel 5.17.0-rc8-mm1-00485-g2757cee2d6c6 1
[ OK ] Reached target Sound Card.
[ OK ] Reached target Printer.
[ OK ] Started LSB: Load kernel image with kexec.
LKP: ttyS0: 290: /lkp/lkp/src/bin/run-lkp /lkp/jobs/scheduled/vm-icl-74/boot-1-debian-i386-20191205.cgz-2757cee2d6c6c76f672ec6566ade2dcb8c1605dd-20220322-39641-na3mp9-2.yaml
[ 32.682503][ T516] wget (516) used greatest stack depth: 5812 bytes left
[ OK ] Started System Logging Service.
[ 36.389049][ T556] rsync (556) used greatest stack depth: 5772 bytes left
[ 37.653297][ T315] LKP: stdout: 290: Kernel tests: Boot OK!
[ 37.653316][ T315]
[ 40.012523][ T605] wget (605) used greatest stack depth: 5732 bytes left
LKP: ttyS0: 290: LKP: rebooting forcely
[ 42.275963][ T315] LKP: stdout: 290: HOSTNAME vm-icl-74, MAC 52:54:00:12:34:56, kernel 5.17.0-rc8-mm1-00485-g2757cee2d6c6 1
[ 42.275991][ T315]
[ 42.291539][ T315] install debs round one: dpkg -i --force-confdef --force-depends /opt/deb/gawk_1%3a4.1.4+dfsg-1_i386.deb
[ 42.291562][ T315]
[ 42.297409][ T315] Selecting previously unselected package gawk.
[ 42.297426][ T315]
[ 42.314957][ T315] (Reading database ... 16210 files and directories currently installed.)
[ 42.314974][ T315]
[ 42.319689][ T315] Preparing to unpack .../gawk_1%3a4.1.4+dfsg-1_i386.deb ...
[ 42.319710][ T315]
[ 42.323192][ T315] Unpacking gawk (1:4.1.4+dfsg-1) ...
[ 42.323208][ T315]
[ 42.326270][ T315] Setting up gawk (1:4.1.4+dfsg-1) ...
[ 42.326285][ T315]
[ 42.941438][ T290] sysrq: Emergency Sync
[ 42.942418][ T36] Emergency Sync complete
[ 42.943081][ T290] sysrq: Resetting

Kboot worker: lkp-worker04
Elapsed time: 60

kvm=(
qemu-system-x86_64
-enable-kvm


To reproduce:

# build kernel
cd linux
cp config-5.17.0-rc8-mm1-00485-g2757cee2d6c6 .config
make HOSTCC=gcc-9 CC=gcc-9 ARCH=i386 olddefconfig prepare modules_prepare bzImage modules
make HOSTCC=gcc-9 CC=gcc-9 ARCH=i386 INSTALL_MOD_PATH=<mod-install-dir> modules_install
cd <mod-install-dir>
find lib/ | cpio -o -H newc --quiet | gzip > modules.cgz


git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp qemu -k <bzImage> -m modules.cgz job-script # job-script is attached in this email

# if come across any failure that blocks the test,
# please remove ~/.lkp and /lkp dir to run from a clean state.



--
0-DAY CI Kernel Test Service
https://01.org/lkp



Attachments:
(No filename) (9.40 kB)
config-5.17.0-rc8-mm1-00485-g2757cee2d6c6 (128.51 kB)
job-script (4.80 kB)
dmesg.xz (15.56 kB)
Download all attachments