2022-08-11 23:18:50

by Zi Yan

[permalink] [raw]
Subject: [RFC PATCH v2 00/12] Make MAX_ORDER adjustable as a kernel boot time parameter.

From: Zi Yan <[email protected]>

Hi all,

This patchset adds support for kernel boot time adjustable MAX_ORDER, so that
user can change the largest size of pages buddy allocator allocates.
It is on top of mm-everything-2022-08-11-02-10.

Motivation
===

This enables kernel to allocate 1GB pages and is necessary for my ongoing work
on adding support for 1GB PUD THP[1]. This is also the conclusion I came up with
after some discussion with David Hildenbrand on what methods should be used for
allocating gigantic pages[2], since other approaches like using CMA allocator or
alloc_contig_pages() are regarded as suboptimal.

In addition, make MAX_ORDER a kernel boot time parameter can enable user to
adjust buddy allocator without recompiling the kernel for their own needs, so
that one can still have a small MAX_ORDER if he/she does not need to allocate
gigantic pages like 1GB PUD THPs.

Background
===

At the moment, kernel imposes MAX_ORDER - 1 + PAGE_SHFIT < SECTION_SIZE_BITS
restriction. This prevents buddy allocator merging pages across memory sections,
as PFNs might not be contiguous and code like page++ would fail. But this would
not be an issue when SPARSEMEM_VMEMMAP is set, since all struct page are
virtually contiguous. So boot time adjustable MAX_ORDER depends on
SPARSEMEM_VMEMMAP.

Description
===

I tested the patchset on both x86_64 and ARM64 at 4KB base pages. The systems
boot and run. It definitely needs more tests and reviews.

In terms of the concerns on performance degradation if MAX_ORDER is increased,
I run vm-scalability from lkp comparing current system, my patchset with
MAX_ORDER=11 and my patchset with MAX_ORDER=20 on a x86_64 VM and saw
almost no performance difference, please attached vm-scalability reports.

Patch 1 renames FORCE_MAX_ZONEORDER to ARCH_FORCE_MAX_ORDER for a more
precise description.

Patch 2 changes MAX_ORDER to represent the max order of pages allocated
by buddy allocator. right now MAX_ORDER - 1 represents that and it is
confusing. Suggested by Vlastimil Babka.

Patch 3 replaces MAX_ORDER with MAX_PHYS_CONTIG_ORDER when it is used to
indicate the maximum number of physically contiguous pages.

Patch 4 fixes deferred struct page initialization when MAX_ORDER is
bigger than a memory section size.

Patch 5-8 convert the use of MAX_ORDER to pageblock_order. Since
pageblock_order is a constant when MAX_ORDER can be changed at boot time
and close to current MAX_ORDER value. I separate changes to different patches
for easy review and can merge them into a single one if that works better.

Patch 9 adds a new Kconfig option SET_MAX_ORDER to allow specifying MAX_ORDER
when ARCH_FORCE_MAX_ORDER is not used by the arch, like x86_64.

Patch 10 converts statically allocated arrays with MAX_ORDER length to dynamic
ones if possible and prepares for making MAX_ORDER a boot time parameter.

Patch 11 adds a new MIN_MAX_ORDER constant to replace soon-to-be-dynamic
MAX_ORDER for places where converting static array to dynamic one is causing
hassle and not necessary, i.e., ARM64 hypervisor page allocation and SLAB.

Patch 12 changes MAX_ORDER to be a kernel boot time parameter and it is
opt-in as an mm/Kconfig option.


Any suggestion and/or comment is welcome. Thanks.


[1] https://lore.kernel.org/linux-mm/[email protected]/
[2] https://lore.kernel.org/linux-mm/[email protected]/

Performance comparison
====

Only the changed stats is shown below. If you do not see some stats,
they are the same across three.

=========================================================================================
compiler/kconfig/rootfs/runtime/tbox_group/test/testcase:
gcc-11/defconfig/debian/300s/qemu-vm/small-allocs/vm-scalability

commit:
5.19.0-rc4-mm-everything+
5.19.0-rc4-boot-time-max-order-10+
5.19.0-rc4-boot-time-max-order-20+

5.19.0-rc4-mm-ev 5.19.0-rc4-boot-time-max-or 5.19.0-rc4-boot-time-max-or
---------------- --------------------------- ---------------------------
%stddev %change %stddev %change %stddev
\ | \ | \
1266004 -0.3% 1262674 +1.1% 1279441 vm-scalability.median

=========================================================================================
compiler/kconfig/rootfs/runtime/tbox_group/test/testcase:
gcc-11/defconfig/debian/300s/qemu-vm/small-allocs-mt/vm-scalability

commit:
5.19.0-rc4-mm-everything+
5.19.0-rc4-boot-time-max-order-10+
5.19.0-rc4-boot-time-max-order-20+

5.19.0-rc4-mm-ev 5.19.0-rc4-boot-time-max-or 5.19.0-rc4-boot-time-max-or
---------------- --------------------------- ---------------------------
%stddev %change %stddev %change %stddev
\ | \ | \
76016 +0.2% 76178 +1.2% 76936 vm-scalability.median
1216312 +0.2% 1218252 +1.2% 1231465 vm-scalability.throughput
3.653e+08 +0.2% 3.659e+08 +1.3% 3.701e+08 vm-scalability.workload

=========================================================================================
compiler/kconfig/rootfs/runtime/tbox_group/test/testcase:
gcc-11/defconfig/debian/300s/qemu-vm/mmap-xread-seq-mt/vm-scalability

commit:
5.19.0-rc4-mm-everything+
5.19.0-rc4-boot-time-max-order-10+
5.19.0-rc4-boot-time-max-order-20+

5.19.0-rc4-mm-ev 5.19.0-rc4-boot-time-max-or 5.19.0-rc4-boot-time-max-or
---------------- --------------------------- ---------------------------
%stddev %change %stddev %change %stddev
\ | \ | \
1574020 ± 2% +0.1% 1576169 -2.3% 1537232 ± 2% vm-scalability.median
25184277 ± 2% +0.1% 25218477 -2.3% 24595646 ± 2% vm-scalability.throughput
7.567e+09 ± 2% +0.1% 7.575e+09 -2.3% 7.395e+09 ± 2% vm-scalability.workload

=========================================================================================
compiler/kconfig/rootfs/runtime/tbox_group/test/testcase:
gcc-11/defconfig/debian/300s/qemu-vm/mmap-pread-rand/vm-scalability

commit:
5.19.0-rc4-mm-everything+
5.19.0-rc4-boot-time-max-order-10+
5.19.0-rc4-boot-time-max-order-20+

5.19.0-rc4-mm-ev 5.19.0-rc4-boot-time-max-or 5.19.0-rc4-boot-time-max-or
---------------- --------------------------- ---------------------------
%stddev %change %stddev %change %stddev
\ | \ | \
2.28 ± 11% -21.0% 1.80 ± 11% -18.2% 1.87 ± 11% vm-scalability.free_time
8.58 ± 9% +3.4 11.95 ± 7% +1.1 9.69 ± 13% vm-scalability.stddev%
1541489 -0.2% 1539102 +1.3% 1561678 vm-scalability.throughput

=========================================================================================
compiler/kconfig/rootfs/runtime/tbox_group/test/testcase:
gcc-11/defconfig/debian/300s/qemu-vm/mmap-pread-rand-mt/vm-scalability

commit:
5.19.0-rc4-mm-everything+
5.19.0-rc4-boot-time-max-order-10+
5.19.0-rc4-boot-time-max-order-20+

5.19.0-rc4-mm-ev 5.19.0-rc4-boot-time-max-or 5.19.0-rc4-boot-time-max-or
---------------- --------------------------- ---------------------------
%stddev %change %stddev %change %stddev
\ | \ | \
94376 +0.4% 94716 +1.8% 96103 vm-scalability.median
12.96 ± 3% +11.9 24.88 ± 80% +0.3 13.30 ± 5% vm-scalability.stddev%
1509455 +0.8% 1522093 +1.8% 1536886 vm-scalability.throughput

=========================================================================================
compiler/kconfig/rootfs/runtime/tbox_group/test/testcase:
gcc-11/defconfig/debian/300s/qemu-vm/lru-file-readtwice/vm-scalability

commit:
5.19.0-rc4-mm-everything+
5.19.0-rc4-boot-time-max-order-10+
5.19.0-rc4-boot-time-max-order-20+

5.19.0-rc4-mm-ev 5.19.0-rc4-boot-time-max-or 5.19.0-rc4-boot-time-max-or
---------------- --------------------------- ---------------------------
%stddev %change %stddev %change %stddev
\ | \ | \
433656 -5.3% 410460 ± 2% -4.8% 412737 vm-scalability.median
13879867 -5.5% 13118050 ± 2% -4.8% 13212361 vm-scalability.throughput
4.164e+09 -5.5% 3.935e+09 ± 2% -4.8% 3.964e+09 vm-scalability.workload

=========================================================================================
compiler/kconfig/rootfs/runtime/tbox_group/test/testcase:
gcc-11/defconfig/debian/300s/qemu-vm/lru-file-mmap-read/vm-scalability

commit:
5.19.0-rc4-mm-everything+
5.19.0-rc4-boot-time-max-order-10+
5.19.0-rc4-boot-time-max-order-20+

5.19.0-rc4-mm-ev 5.19.0-rc4-boot-time-max-or 5.19.0-rc4-boot-time-max-or
---------------- --------------------------- ---------------------------
%stddev %change %stddev %change %stddev
\ | \ | \
488915 ± 3% -2.3% 477658 ± 6% -11.7% 431771 ± 6% vm-scalability.median
120.69 ± 35% -39.0 81.65 ± 84% -89.1 31.56 ±154% vm-scalability.stddev%
8106774 ± 4% -3.3% 7835670 ± 7% -13.9% 6981078 ± 8% vm-scalability.throughput
2.435e+09 ± 4% -3.4% 2.353e+09 ± 7% -13.8% 2.099e+09 ± 8% vm-scalability.workload

=========================================================================================
compiler/kconfig/rootfs/runtime/tbox_group/test/testcase:
gcc-11/defconfig/debian/300s/qemu-vm/anon-rx-rand-mt/vm-scalability

commit:
5.19.0-rc4-mm-everything+
5.19.0-rc4-boot-time-max-order-10+
5.19.0-rc4-boot-time-max-order-20+

5.19.0-rc4-mm-ev 5.19.0-rc4-boot-time-max-or 5.19.0-rc4-boot-time-max-or
---------------- --------------------------- ---------------------------
%stddev %change %stddev %change %stddev
\ | \ | \
196783 -0.8% 195189 ± 2% -2.8% 191323 vm-scalability.median
53.88 ± 3% -4.9 48.96 ± 2% -43.9 9.95 ± 35% vm-scalability.stddev%

=========================================================================================
compiler/kconfig/rootfs/runtime/tbox_group/test/testcase:
gcc-11/defconfig/debian/300s/qemu-vm/anon-r-seq-mt/vm-scalability

commit:
5.19.0-rc4-mm-everything+
5.19.0-rc4-boot-time-max-order-10+
5.19.0-rc4-boot-time-max-order-20+

5.19.0-rc4-mm-ev 5.19.0-rc4-boot-time-max-or 5.19.0-rc4-boot-time-max-or
---------------- --------------------------- ---------------------------
%stddev %change %stddev %change %stddev
\ | \ | \
50.03 ± 29% -15.9 34.08 ± 12% -2.4 47.66 ± 32% vm-scalability.stddev%

=========================================================================================
compiler/kconfig/rootfs/runtime/tbox_group/test/testcase:
gcc-11/defconfig/debian/300s/qemu-vm/anon-r-rand/vm-scalability

commit:
5.19.0-rc4-mm-everything+
5.19.0-rc4-boot-time-max-order-10+
5.19.0-rc4-boot-time-max-order-20+

5.19.0-rc4-mm-ev 5.19.0-rc4-boot-time-max-or 5.19.0-rc4-boot-time-max-or
---------------- --------------------------- ---------------------------
%stddev %change %stddev %change %stddev
\ | \ | \
3.82 -0.8% 3.79 -0.6% 3.79 ± 3% vm-scalability.free_time
172116 +0.3% 172685 -2.1% 168557 vm-scalability.median
75.53 ± 12% -15.6 59.88 ± 13% -60.9 14.64 ± 17% vm-scalability.stddev%

=========================================================================================
compiler/kconfig/rootfs/runtime/size/tbox_group/test/testcase:
gcc-11/defconfig/debian/300s/8T/qemu-vm/anon-wx-seq-mt/vm-scalability

commit:
5.19.0-rc4-mm-everything+
5.19.0-rc4-boot-time-max-order-10+
5.19.0-rc4-boot-time-max-order-20+

5.19.0-rc4-mm-ev 5.19.0-rc4-boot-time-max-or 5.19.0-rc4-boot-time-max-or
---------------- --------------------------- ---------------------------
%stddev %change %stddev %change %stddev
\ | \ | \
798340 +0.8% 804861 -3.0% 774082 vm-scalability.median
1.55 ± 28% -0.1 1.47 ± 32% +0.6 2.19 ± 14% vm-scalability.median_stddev%
1.55 ± 28% -0.1 1.47 ± 32% +0.6 2.19 ± 14% vm-scalability.stddev%
12773455 +0.8% 12877783 -3.0% 12385319 vm-scalability.throughput

=========================================================================================
compiler/kconfig/rootfs/runtime/size/tbox_group/test/testcase:
gcc-11/defconfig/debian/300s/8T/qemu-vm/anon-w-seq/vm-scalability

commit:
5.19.0-rc4-mm-everything+
5.19.0-rc4-boot-time-max-order-10+
5.19.0-rc4-boot-time-max-order-20+

5.19.0-rc4-mm-ev 5.19.0-rc4-boot-time-max-or 5.19.0-rc4-boot-time-max-or
---------------- --------------------------- ---------------------------
%stddev %change %stddev %change %stddev
\ | \ | \
0.13 +4.1% 0.14 ± 3% +37.0% 0.18 vm-scalability.free_time
923091 -0.6% 917275 ± 2% +3.7% 957298 vm-scalability.median
4.57 ± 2% -1.7 2.89 ± 15% -3.9 0.68 ± 7% vm-scalability.median_stddev%
14811265 -0.5% 14731710 +1.8% 15079698 vm-scalability.throughput
3.173e+09 -0.5% 3.156e+09 ± 2% -1.2% 3.134e+09 vm-scalability.workload

=========================================================================================
compiler/kconfig/rootfs/runtime/size/tbox_group/test/testcase:
gcc-11/defconfig/debian/300s/8T/qemu-vm/anon-w-seq-mt/vm-scalability

commit:
5.19.0-rc4-mm-everything+
5.19.0-rc4-boot-time-max-order-10+
5.19.0-rc4-boot-time-max-order-20+

5.19.0-rc4-mm-ev 5.19.0-rc4-boot-time-max-or 5.19.0-rc4-boot-time-max-or
---------------- --------------------------- ---------------------------
%stddev %change %stddev %change %stddev
\ | \ | \
0.07 +1.1% 0.07 +3.6% 0.07 vm-scalability.free_time
667055 -1.6% 656481 -1.4% 657861 vm-scalability.median
2.22 ± 4% -0.1 2.12 ± 6% +0.4 2.60 ± 14% vm-scalability.median_stddev%
10817276 -1.3% 10673638 -2.3% 10568517 vm-scalability.throughput
2.022e+09 -1.0% 2.002e+09 -1.5% 1.991e+09 vm-scalability.workload

=========================================================================================
compiler/kconfig/rootfs/runtime/size/tbox_group/test/testcase:
gcc-11/defconfig/debian/300s/8T/qemu-vm/anon-cow-seq/vm-scalability

commit:
5.19.0-rc4-mm-everything+
5.19.0-rc4-boot-time-max-order-10+
5.19.0-rc4-boot-time-max-order-20+

5.19.0-rc4-mm-ev 5.19.0-rc4-boot-time-max-or 5.19.0-rc4-boot-time-max-or
---------------- --------------------------- ---------------------------
%stddev %change %stddev %change %stddev
\ | \ | \
475554 -0.5% 473278 +2.6% 487908 vm-scalability.median
4.20 ± 2% -1.5 2.73 ± 6% -3.3 0.89 ± 6% vm-scalability.median_stddev%
3.58 ± 3% -1.0 2.58 ± 5% -1.7 1.88 ± 9% vm-scalability.stddev%
7533010 +0.4% 7559545 +1.7% 7663820 vm-scalability.throughput
1.764e+09 +1.8% 1.795e+09 +1.2% 1.785e+09 vm-scalability.workload

=========================================================================================
compiler/kconfig/rootfs/runtime/size/tbox_group/test/testcase:
gcc-11/defconfig/debian/300s/8T/qemu-vm/anon-cow-seq-mt/vm-scalability

commit:
5.19.0-rc4-mm-everything+
5.19.0-rc4-boot-time-max-order-10+
5.19.0-rc4-boot-time-max-order-20+

5.19.0-rc4-mm-ev 5.19.0-rc4-boot-time-max-or 5.19.0-rc4-boot-time-max-or
---------------- --------------------------- ---------------------------
%stddev %change %stddev %change %stddev
\ | \ | \
1.13 ± 14% -0.3 0.85 ± 15% -0.5 0.66 ± 32% vm-scalability.median_stddev%
1.13 ± 14% -0.3 0.85 ± 15% -0.5 0.66 ± 32% vm-scalability.stddev%

=========================================================================================
compiler/kconfig/rootfs/runtime/size/tbox_group/test/testcase:
gcc-11/defconfig/debian/300s/512G/qemu-vm/anon-wx-rand-mt/vm-scalability

commit:
5.19.0-rc4-mm-everything+
5.19.0-rc4-boot-time-max-order-10+
5.19.0-rc4-boot-time-max-order-20+

5.19.0-rc4-mm-ev 5.19.0-rc4-boot-time-max-or 5.19.0-rc4-boot-time-max-or
---------------- --------------------------- ---------------------------
%stddev %change %stddev %change %stddev
\ | \ | \
72308 -1.4% 71294 ± 3% -7.9% 66569 vm-scalability.median
0.96 ± 11% -0.0 0.94 ± 14% -0.5 0.44 ± 5% vm-scalability.stddev%
2.743e+08 -0.0% 2.743e+08 +12.7% 3.09e+08 vm-scalability.workload

=========================================================================================
compiler/kconfig/rootfs/runtime/size/tbox_group/test/testcase:
gcc-11/defconfig/debian/300s/512G/qemu-vm/anon-w-rand/vm-scalability

commit:
5.19.0-rc4-mm-everything+
5.19.0-rc4-boot-time-max-order-10+
5.19.0-rc4-boot-time-max-order-20+

5.19.0-rc4-mm-ev 5.19.0-rc4-boot-time-max-or 5.19.0-rc4-boot-time-max-or
---------------- --------------------------- ---------------------------
%stddev %change %stddev %change %stddev
\ | \ | \
0.09 +0.8% 0.09 ± 3% +5.9% 0.10 ± 3% vm-scalability.free_time
67458 ± 2% -6.0% 63414 ± 2% -11.3% 59805 vm-scalability.median
4.66 ± 36% +4.7 9.38 ± 34% -2.2 2.50 ± 23% vm-scalability.median_stddev%
971866 -1.3% 959227 -2.3% 949434 vm-scalability.throughput
2.469e+08 -0.0% 2.469e+08 +11.1% 2.743e+08 vm-scalability.workload

=========================================================================================
compiler/kconfig/rootfs/runtime/size/tbox_group/test/testcase:
gcc-11/defconfig/debian/300s/512G/qemu-vm/anon-w-rand-mt/vm-scalability

commit:
5.19.0-rc4-mm-everything+
5.19.0-rc4-boot-time-max-order-10+
5.19.0-rc4-boot-time-max-order-20+

5.19.0-rc4-mm-ev 5.19.0-rc4-boot-time-max-or 5.19.0-rc4-boot-time-max-or
---------------- --------------------------- ---------------------------
%stddev %change %stddev %change %stddev
\ | \ | \
0.12 ± 2% +2.8% 0.13 +10.5% 0.14 ± 3% vm-scalability.free_time
65926 ± 3% -1.8% 64711 ± 4% -9.3% 59770 vm-scalability.median
4.51 ± 38% +1.3 5.83 ± 48% -3.1 1.44 ± 31% vm-scalability.median_stddev%
1.24 ± 24% -0.3 0.93 ± 25% -0.8 0.48 ± 17% vm-scalability.stddev%
2.395e+08 +1.5% 2.432e+08 +11.5% 2.67e+08 vm-scalability.workload

=========================================================================================
compiler/kconfig/rootfs/runtime/size/tbox_group/test/testcase:
gcc-11/defconfig/debian/300s/512G/qemu-vm/anon-cow-rand/vm-scalability

commit:
5.19.0-rc4-mm-everything+
5.19.0-rc4-boot-time-max-order-10+
5.19.0-rc4-boot-time-max-order-20+

5.19.0-rc4-mm-ev 5.19.0-rc4-boot-time-max-or 5.19.0-rc4-boot-time-max-or
---------------- --------------------------- ---------------------------
%stddev %change %stddev %change %stddev
\ | \ | \
63519 ± 3% -2.3% 62074 ± 2% -12.2% 55775 vm-scalability.median
914972 -1.2% 904135 -2.4% 893097 vm-scalability.throughput
2.323e+08 -0.0% 2.323e+08 +11.1% 2.582e+08 vm-scalability.workload

=========================================================================================
compiler/kconfig/rootfs/runtime/size/tbox_group/test/testcase:
gcc-11/defconfig/debian/300s/512G/qemu-vm/anon-cow-rand-mt/vm-scalability

commit:
5.19.0-rc4-mm-everything+
5.19.0-rc4-boot-time-max-order-10+
5.19.0-rc4-boot-time-max-order-20+

5.19.0-rc4-mm-ev 5.19.0-rc4-boot-time-max-or 5.19.0-rc4-boot-time-max-or
---------------- --------------------------- ---------------------------
%stddev %change %stddev %change %stddev
\ | \ | \
64719 ± 2% -1.7% 63626 ± 2% -7.4% 59953 vm-scalability.median
3.32 ± 77% +1.3 4.64 ± 64% -2.3 1.02 ± 60% vm-scalability.median_stddev%
0.83 ± 27% -0.1 0.74 ± 53% -0.7 0.18 ± 43% vm-scalability.stddev%

=========================================================================================
compiler/kconfig/rootfs/runtime/size/tbox_group/test/testcase:
gcc-11/defconfig/debian/300s/2T/qemu-vm/shm-xread-seq/vm-scalability

commit:
5.19.0-rc4-mm-everything+
5.19.0-rc4-boot-time-max-order-10+
5.19.0-rc4-boot-time-max-order-20+

5.19.0-rc4-mm-ev 5.19.0-rc4-boot-time-max-or 5.19.0-rc4-boot-time-max-or
---------------- --------------------------- ---------------------------
%stddev %change %stddev %change %stddev
\ | \ | \
346505 +2.5% 355073 +1.8% 352797 vm-scalability.median
1.29 ± 26% +0.4 1.73 ± 11% +0.2 1.47 ± 22% vm-scalability.median_stddev%
1.29 ± 26% +0.4 1.73 ± 11% +0.2 1.47 ± 22% vm-scalability.stddev%
5544053 +2.5% 5681145 +1.8% 5644734 vm-scalability.throughput

=========================================================================================
compiler/kconfig/rootfs/runtime/size/tbox_group/test/testcase:
gcc-11/defconfig/debian/300s/2T/qemu-vm/shm-pread-seq/vm-scalability

commit:
5.19.0-rc4-mm-everything+
5.19.0-rc4-boot-time-max-order-10+
5.19.0-rc4-boot-time-max-order-20+

5.19.0-rc4-mm-ev 5.19.0-rc4-boot-time-max-or 5.19.0-rc4-boot-time-max-or
---------------- --------------------------- ---------------------------
%stddev %change %stddev %change %stddev
\ | \ | \
3.06 +10.8% 3.40 +11.5% 3.42 vm-scalability.free_time
344737 +3.5% 356824 +2.0% 351766 vm-scalability.median
5515773 +3.5% 5709150 +2.0% 5628245 vm-scalability.throughput

=========================================================================================
compiler/kconfig/rootfs/runtime/size/tbox_group/test/testcase:
gcc-11/defconfig/debian/300s/2T/qemu-vm/shm-pread-seq-mt/vm-scalability

commit:
5.19.0-rc4-mm-everything+
5.19.0-rc4-boot-time-max-order-10+
5.19.0-rc4-boot-time-max-order-20+

5.19.0-rc4-mm-ev 5.19.0-rc4-boot-time-max-or 5.19.0-rc4-boot-time-max-or
---------------- --------------------------- ---------------------------
%stddev %change %stddev %change %stddev
\ | \ | \
363265 +2.4% 371881 +2.0% 370384 vm-scalability.median
5807625 +2.4% 5948137 +2.0% 5922313 vm-scalability.throughput

=========================================================================================
compiler/kconfig/rootfs/runtime/size/tbox_group/test/testcase:
gcc-11/defconfig/debian/300s/256G/qemu-vm/msync/vm-scalability

commit:
5.19.0-rc4-mm-everything+
5.19.0-rc4-boot-time-max-order-10+
5.19.0-rc4-boot-time-max-order-20+

5.19.0-rc4-mm-ev 5.19.0-rc4-boot-time-max-or 5.19.0-rc4-boot-time-max-or
---------------- --------------------------- ---------------------------
%stddev %change %stddev %change %stddev
\ | \ | \
124686 ± 4% +2.9% 128345 +9.0% 135953 vm-scalability.median
19.68 ± 9% -1.2 18.47 ± 6% -5.0 14.67 ± 3% vm-scalability.median_stddev%
18.87 ± 9% -1.5 17.38 ± 9% -6.1 12.76 ± 4% vm-scalability.stddev%
2047903 ± 2% +2.1% 2090681 +4.4% 2138545 vm-scalability.throughput

=========================================================================================
compiler/kconfig/rootfs/runtime/size/tbox_group/test/testcase:
gcc-11/defconfig/debian/300s/256G/qemu-vm/lru-shm-rand/vm-scalability

commit:
5.19.0-rc4-mm-everything+
5.19.0-rc4-boot-time-max-order-10+
5.19.0-rc4-boot-time-max-order-20+

5.19.0-rc4-mm-ev 5.19.0-rc4-boot-time-max-or 5.19.0-rc4-boot-time-max-or
---------------- --------------------------- ---------------------------
%stddev %change %stddev %change %stddev
\ | \ | \
0.03 -0.4% 0.03 -2.0% 0.02 vm-scalability.free_time
7.02 ± 18% +0.7 7.68 ± 12% -5.4 1.65 ± 13% vm-scalability.median_stddev%

=========================================================================================
compiler/kconfig/rootfs/runtime/size/tbox_group/test/testcase:
gcc-11/defconfig/debian/300s/1T/qemu-vm/lru-shm/vm-scalability

commit:
5.19.0-rc4-mm-everything+
5.19.0-rc4-boot-time-max-order-10+
5.19.0-rc4-boot-time-max-order-20+

5.19.0-rc4-mm-ev 5.19.0-rc4-boot-time-max-or 5.19.0-rc4-boot-time-max-or
---------------- --------------------------- ---------------------------
%stddev %change %stddev %change %stddev
\ | \ | \
2.36 ± 4% -0.5 1.90 ± 13% -1.6 0.80 ± 20% vm-scalability.median_stddev%

=========================================================================================
compiler/kconfig/rootfs/runtime/size/tbox_group/test/testcase:
gcc-11/defconfig/debian/300s/16G/qemu-vm/shm-xread-rand/vm-scalability

commit:
5.19.0-rc4-mm-everything+
5.19.0-rc4-boot-time-max-order-10+
5.19.0-rc4-boot-time-max-order-20+

5.19.0-rc4-mm-ev 5.19.0-rc4-boot-time-max-or 5.19.0-rc4-boot-time-max-or
---------------- --------------------------- ---------------------------
%stddev %change %stddev %change %stddev
\ | \ | \
168.32 ± 6% -0.1 168.24 ± 17% -114.4 53.94 ± 59% vm-scalability.stddev%

=========================================================================================
compiler/kconfig/rootfs/runtime/size/tbox_group/test/testcase:
gcc-11/defconfig/debian/300s/16G/qemu-vm/shm-pread-rand/vm-scalability

commit:
5.19.0-rc4-mm-everything+
5.19.0-rc4-boot-time-max-order-10+
5.19.0-rc4-boot-time-max-order-20+

5.19.0-rc4-mm-ev 5.19.0-rc4-boot-time-max-or 5.19.0-rc4-boot-time-max-or
---------------- --------------------------- ---------------------------
%stddev %change %stddev %change %stddev
\ | \ | \
172.46 ± 8% -14.2 158.30 ± 20% -106.4 66.04 ± 74% vm-scalability.stddev%

=========================================================================================
compiler/kconfig/rootfs/runtime/size/tbox_group/test/testcase/unit_size:
gcc-11/defconfig/debian/300s/16G/qemu-vm/shm-pread-rand-mt/vm-scalability/1G

commit:
5.19.0-rc4-mm-everything+
5.19.0-rc4-boot-time-max-order-10+
5.19.0-rc4-boot-time-max-order-20+

5.19.0-rc4-mm-ev 5.19.0-rc4-boot-time-max-or 5.19.0-rc4-boot-time-max-or
---------------- --------------------------- ---------------------------
%stddev %change %stddev %change %stddev
\ | \ | \
0.06 ± 4% +5.0% 0.07 ± 5% +38.6% 0.09 ± 6% vm-scalability.free_time

=========================================================================================
compiler/kconfig/rootfs/runtime/size/tbox_group/test/testcase:
gcc-11/defconfig/debian/300s/128G/qemu-vm/truncate-seq/vm-scalability

commit:
5.19.0-rc4-mm-everything+
5.19.0-rc4-boot-time-max-order-10+
5.19.0-rc4-boot-time-max-order-20+

5.19.0-rc4-mm-ev 5.19.0-rc4-boot-time-max-or 5.19.0-rc4-boot-time-max-or
---------------- --------------------------- ---------------------------
%stddev %change %stddev %change %stddev
\ | \ | \
9.00 ± 15% +1.6 10.59 ± 11% -2.6 6.35 ± 8% vm-scalability.median_fault_stddev%
9.00 ± 15% +1.6 10.59 ± 11% -2.6 6.35 ± 8% vm-scalability.stddev_fault%


Zi Yan (12):
arch: mm: rename FORCE_MAX_ZONEORDER to ARCH_FORCE_MAX_ORDER
mm: rectify MAX_ORDER semantics to be the largest page order from
buddy allocator
mm: replace MAX_ORDER when it is used to indicate max physical
contiguity.
mm: adapt deferred struct page init to new MAX_ORDER.
mm: prevent pageblock size being larger than section size.
fs: proc: use pageblock_nr_pages for reschedule period in read_kcore()
virtio: virtio_balloon: use pageblock_order instead of MAX_ORDER
mm/page_reporting: set page_reporting_order to -1 to prevent it
running
mm: Make MAX_ORDER of buddy allocator configurable via Kconfig
SET_MAX_ORDER.
mm: convert MAX_ORDER sized static arrays to dynamic ones.
mm: introduce MIN_MAX_ORDER to replace MAX_ORDER as compile time
constant.
mm: make MAX_ORDER a kernel boot time parameter.

.../admin-guide/kdump/vmcoreinfo.rst | 4 +-
.../admin-guide/kernel-parameters.txt | 9 +-
arch/Kconfig | 4 +
arch/arc/Kconfig | 6 +-
arch/arm/Kconfig | 14 +-
arch/arm/configs/imx_v6_v7_defconfig | 2 +-
arch/arm/configs/milbeaut_m10v_defconfig | 2 +-
arch/arm/configs/oxnas_v6_defconfig | 2 +-
arch/arm/configs/sama7_defconfig | 2 +-
arch/arm64/Kconfig | 18 ++-
arch/arm64/include/asm/sparsemem.h | 2 +-
arch/arm64/kvm/hyp/include/nvhe/gfp.h | 2 +-
arch/arm64/kvm/hyp/nvhe/page_alloc.c | 2 +-
arch/csky/Kconfig | 4 +-
arch/ia64/Kconfig | 10 +-
arch/ia64/include/asm/sparsemem.h | 6 +-
arch/ia64/mm/hugetlbpage.c | 2 +-
arch/m68k/Kconfig.cpu | 10 +-
arch/mips/Kconfig | 24 ++--
arch/nios2/Kconfig | 12 +-
arch/powerpc/Kconfig | 32 ++---
arch/powerpc/configs/85xx/ge_imp3a_defconfig | 2 +-
arch/powerpc/configs/fsl-emb-nonhw.config | 2 +-
arch/powerpc/mm/book3s64/iommu_api.c | 2 +-
arch/powerpc/mm/hugetlbpage.c | 2 +-
arch/powerpc/platforms/powernv/pci-ioda.c | 2 +-
arch/sh/configs/ecovec24_defconfig | 2 +-
arch/sh/mm/Kconfig | 22 ++-
arch/sparc/Kconfig | 10 +-
arch/sparc/kernel/pci_sun4v.c | 2 +-
arch/sparc/kernel/traps_64.c | 2 +-
arch/sparc/mm/tsb.c | 4 +-
arch/um/kernel/um_arch.c | 4 +-
arch/xtensa/Kconfig | 10 +-
drivers/base/regmap/regmap-debugfs.c | 8 +-
drivers/crypto/hisilicon/sgl.c | 6 +-
.../gpu/drm/i915/gem/selftests/huge_pages.c | 2 +-
drivers/gpu/drm/ttm/ttm_device.c | 7 +-
drivers/gpu/drm/ttm/ttm_pool.c | 72 ++++++++--
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 2 +-
drivers/irqchip/irq-gic-v3-its.c | 4 +-
drivers/md/dm-bufio.c | 2 +-
drivers/misc/genwqe/card_utils.c | 2 +-
drivers/net/ethernet/ibm/ibmvnic.h | 2 +-
drivers/video/fbdev/hyperv_fb.c | 6 +-
drivers/virtio/virtio_balloon.c | 2 +-
drivers/virtio/virtio_mem.c | 8 +-
fs/proc/kcore.c | 2 +-
fs/ramfs/file-nommu.c | 2 +-
include/drm/ttm/ttm_pool.h | 4 +-
include/linux/hugetlb.h | 2 +-
include/linux/mmzone.h | 36 ++++-
include/linux/pageblock-flags.h | 21 ++-
include/linux/slab.h | 8 +-
kernel/crash_core.c | 2 +-
kernel/dma/pool.c | 8 +-
mm/Kconfig | 33 ++++-
mm/compaction.c | 8 +-
mm/debug_vm_pgtable.c | 4 +-
mm/huge_memory.c | 2 +-
mm/hugetlb.c | 4 +-
mm/internal.h | 10 +-
mm/memblock.c | 8 +-
mm/memory.c | 4 +-
mm/memory_hotplug.c | 6 +-
mm/page_alloc.c | 128 +++++++++++++-----
mm/page_isolation.c | 14 +-
mm/page_owner.c | 6 +-
mm/page_reporting.c | 8 +-
mm/shuffle.h | 2 +-
mm/slab.c | 2 +-
mm/slub.c | 6 +-
mm/vmscan.c | 1 -
mm/vmstat.c | 14 +-
net/smc/smc_ib.c | 2 +-
scripts/checkpatch.pl | 8 ++
security/integrity/ima/ima_crypto.c | 2 +-
tools/testing/memblock/linux/mmzone.h | 6 +-
78 files changed, 451 insertions(+), 270 deletions(-)

--
2.35.1


2022-08-11 23:19:16

by Zi Yan

[permalink] [raw]
Subject: [RFC PATCH v2 06/12] fs: proc: use pageblock_nr_pages for reschedule period in read_kcore()

From: Zi Yan <[email protected]>

MAX_ORDER_NR_PAGES can be increased when it becomes a boot time parameter
in later commits. To make sure read_kcore() reschedule its work in a
constant period, use pageblock_nr_pages instead for reschedule period,
since pageblock_nr_pages is a constant and either the same or half of
MAX_ORDER_NR_PAGES.

Signed-off-by: Zi Yan <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Ying Chen <[email protected]>
Cc: Feng Zhou <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
fs/proc/kcore.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
index dff921f7ca33..7dc09d211b48 100644
--- a/fs/proc/kcore.c
+++ b/fs/proc/kcore.c
@@ -491,7 +491,7 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
}
}

- if (page_offline_frozen++ % MAX_ORDER_NR_PAGES == 0) {
+ if (page_offline_frozen++ % pageblock_nr_pages == 0) {
page_offline_thaw();
cond_resched();
page_offline_freeze();
--
2.35.1

2022-08-11 23:20:52

by Zi Yan

[permalink] [raw]
Subject: [RFC PATCH v2 02/12] mm: rectify MAX_ORDER semantics to be the largest page order from buddy allocator

From: Zi Yan <[email protected]>

MAX_ORDER used to denote the largest page order + 1, but that was
confusing and caused several off-by-1 errors in the code. Fix it by
setting MAX_ORDER to the largest page order from buddy allocator like
what its name says.

Add a warning in checkpatch.pl about the semantics change.

Signed-off-by: Zi Yan <[email protected]>
---
.../admin-guide/kdump/vmcoreinfo.rst | 4 +-
.../admin-guide/kernel-parameters.txt | 4 +-
arch/arc/Kconfig | 4 +-
arch/arm/Kconfig | 12 +++---
arch/arm/configs/imx_v6_v7_defconfig | 2 +-
arch/arm/configs/milbeaut_m10v_defconfig | 2 +-
arch/arm/configs/oxnas_v6_defconfig | 2 +-
arch/arm/configs/sama7_defconfig | 2 +-
arch/arm64/Kconfig | 16 ++++----
arch/arm64/include/asm/sparsemem.h | 2 +-
arch/arm64/kvm/hyp/include/nvhe/gfp.h | 2 +-
arch/csky/Kconfig | 2 +-
arch/ia64/Kconfig | 8 ++--
arch/ia64/include/asm/sparsemem.h | 4 +-
arch/ia64/mm/hugetlbpage.c | 2 +-
arch/m68k/Kconfig.cpu | 8 ++--
arch/mips/Kconfig | 22 +++++-----
arch/nios2/Kconfig | 10 ++---
arch/powerpc/Kconfig | 30 +++++++-------
arch/powerpc/configs/85xx/ge_imp3a_defconfig | 2 +-
arch/powerpc/configs/fsl-emb-nonhw.config | 2 +-
arch/powerpc/mm/book3s64/iommu_api.c | 2 +-
arch/powerpc/mm/hugetlbpage.c | 2 +-
arch/powerpc/platforms/powernv/pci-ioda.c | 2 +-
arch/sh/configs/ecovec24_defconfig | 2 +-
arch/sh/mm/Kconfig | 20 +++++-----
arch/sparc/Kconfig | 8 ++--
arch/sparc/kernel/pci_sun4v.c | 2 +-
arch/sparc/kernel/traps_64.c | 2 +-
arch/xtensa/Kconfig | 8 ++--
drivers/base/regmap/regmap-debugfs.c | 8 ++--
drivers/crypto/hisilicon/sgl.c | 6 +--
.../gpu/drm/i915/gem/selftests/huge_pages.c | 2 +-
drivers/gpu/drm/ttm/ttm_pool.c | 22 +++++-----
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 2 +-
drivers/irqchip/irq-gic-v3-its.c | 4 +-
drivers/md/dm-bufio.c | 2 +-
drivers/misc/genwqe/card_utils.c | 2 +-
drivers/net/ethernet/ibm/ibmvnic.h | 2 +-
drivers/video/fbdev/hyperv_fb.c | 6 +--
drivers/virtio/virtio_balloon.c | 2 +-
drivers/virtio/virtio_mem.c | 8 ++--
fs/ramfs/file-nommu.c | 2 +-
include/drm/ttm/ttm_pool.h | 2 +-
include/linux/hugetlb.h | 2 +-
include/linux/mmzone.h | 10 ++---
include/linux/pageblock-flags.h | 4 +-
include/linux/slab.h | 8 ++--
kernel/crash_core.c | 2 +-
kernel/dma/pool.c | 6 +--
mm/Kconfig | 6 +--
mm/compaction.c | 8 ++--
mm/debug_vm_pgtable.c | 4 +-
mm/huge_memory.c | 2 +-
mm/hugetlb.c | 4 +-
mm/memblock.c | 2 +-
mm/memory_hotplug.c | 4 +-
mm/page_alloc.c | 40 +++++++++----------
mm/page_isolation.c | 14 +++----
mm/page_owner.c | 6 +--
mm/page_reporting.c | 4 +-
mm/shuffle.h | 2 +-
mm/slab.c | 2 +-
mm/slub.c | 4 +-
mm/vmstat.c | 14 +++----
net/smc/smc_ib.c | 2 +-
scripts/checkpatch.pl | 8 ++++
security/integrity/ima/ima_crypto.c | 2 +-
tools/testing/memblock/linux/mmzone.h | 6 +--
69 files changed, 208 insertions(+), 218 deletions(-)

diff --git a/Documentation/admin-guide/kdump/vmcoreinfo.rst b/Documentation/admin-guide/kdump/vmcoreinfo.rst
index 8419019b6a88..c572b5230fe0 100644
--- a/Documentation/admin-guide/kdump/vmcoreinfo.rst
+++ b/Documentation/admin-guide/kdump/vmcoreinfo.rst
@@ -172,7 +172,7 @@ variables.
Offset of the free_list's member. This value is used to compute the number
of free pages.

-Each zone has a free_area structure array called free_area[MAX_ORDER].
+Each zone has a free_area structure array called free_area[MAX_ORDER + 1].
The free_list represents a linked list of free page blocks.

(list_head, next|prev)
@@ -189,7 +189,7 @@ Offsets of the vmap_area's members. They carry vmalloc-specific
information. Makedumpfile gets the start address of the vmalloc region
from this.

-(zone.free_area, MAX_ORDER)
+(zone.free_area, MAX_ORDER + 1)
---------------------------

Free areas descriptor. User-space tools use this value to iterate the
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index db5de5f0b9d3..ff33971e1630 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -928,7 +928,7 @@
buddy allocator. Bigger value increase the probability
of catching random memory corruption, but reduce the
amount of memory for normal system use. The maximum
- possible value is MAX_ORDER/2. Setting this parameter
+ possible value is (MAX_ORDER + 1)/2. Setting this parameter
to 1 or 2 should be enough to identify most random
memory corruption problems caused by bugs in kernel or
driver code when a CPU writes to (or reads from) a
@@ -3899,7 +3899,7 @@
[KNL] Minimal page reporting order
Format: <integer>
Adjust the minimal page reporting order. The page
- reporting is disabled when it exceeds (MAX_ORDER-1).
+ reporting is disabled when it exceeds MAX_ORDER.

panic= [KNL] Kernel behaviour on panic: delay <timeout>
timeout > 0: seconds before rebooting
diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig
index d9a13ccf89a3..ab6d701365bb 100644
--- a/arch/arc/Kconfig
+++ b/arch/arc/Kconfig
@@ -556,7 +556,7 @@ endmenu # "ARC Architecture Configuration"

config ARCH_FORCE_MAX_ORDER
int "Maximum zone order"
- default "12" if ARC_HUGEPAGE_16M
- default "11"
+ default "11" if ARC_HUGEPAGE_16M
+ default "10"

source "kernel/power/Kconfig"
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index e6c8ee56ac52..c8f2e46cc8c4 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -1436,19 +1436,17 @@ config ARM_MODULE_PLTS

config ARCH_FORCE_MAX_ORDER
int "Maximum zone order"
- default "12" if SOC_AM33XX
- default "9" if SA1111
- default "11"
+ default "11" if SOC_AM33XX
+ default "8" if SA1111
+ default "10"
help
The kernel memory allocator divides physically contiguous memory
blocks into "zones", where each zone is a power of two number of
pages. This option selects the largest power of two that the kernel
keeps in the memory allocator. If you need to allocate very large
blocks of physically contiguous memory, then you may need to
- increase this value.
-
- This config option is actually maximum order plus one. For example,
- a value of 11 means that the largest free memory block is 2^10 pages.
+ increase this value. A value of 10 means that the largest free memory
+ block is 2^10 pages.

config ALIGNMENT_TRAP
def_bool CPU_CP15_MMU
diff --git a/arch/arm/configs/imx_v6_v7_defconfig b/arch/arm/configs/imx_v6_v7_defconfig
index fb283059daa0..eeb14499479d 100644
--- a/arch/arm/configs/imx_v6_v7_defconfig
+++ b/arch/arm/configs/imx_v6_v7_defconfig
@@ -31,7 +31,7 @@ CONFIG_SOC_VF610=y
CONFIG_SMP=y
CONFIG_ARM_PSCI=y
CONFIG_HIGHMEM=y
-CONFIG_ARCH_FORCE_MAX_ORDER=14
+CONFIG_ARCH_FORCE_MAX_ORDER=13
CONFIG_CMDLINE="noinitrd console=ttymxc0,115200"
CONFIG_KEXEC=y
CONFIG_CPU_FREQ=y
diff --git a/arch/arm/configs/milbeaut_m10v_defconfig b/arch/arm/configs/milbeaut_m10v_defconfig
index 8620061e19a8..22732f19e79b 100644
--- a/arch/arm/configs/milbeaut_m10v_defconfig
+++ b/arch/arm/configs/milbeaut_m10v_defconfig
@@ -26,7 +26,7 @@ CONFIG_THUMB2_KERNEL=y
# CONFIG_THUMB2_AVOID_R_ARM_THM_JUMP11 is not set
# CONFIG_ARM_PATCH_IDIV is not set
CONFIG_HIGHMEM=y
-CONFIG_ARCH_FORCE_MAX_ORDER=12
+CONFIG_ARCH_FORCE_MAX_ORDER=11
CONFIG_SECCOMP=y
CONFIG_KEXEC=y
CONFIG_EFI=y
diff --git a/arch/arm/configs/oxnas_v6_defconfig b/arch/arm/configs/oxnas_v6_defconfig
index 5c163a9d1429..7e43aa355467 100644
--- a/arch/arm/configs/oxnas_v6_defconfig
+++ b/arch/arm/configs/oxnas_v6_defconfig
@@ -12,7 +12,7 @@ CONFIG_ARCH_OXNAS=y
CONFIG_MACH_OX820=y
CONFIG_SMP=y
CONFIG_NR_CPUS=16
-CONFIG_ARCH_FORCE_MAX_ORDER=12
+CONFIG_ARCH_FORCE_MAX_ORDER=11
CONFIG_SECCOMP=y
CONFIG_ARM_APPENDED_DTB=y
CONFIG_ARM_ATAG_DTB_COMPAT=y
diff --git a/arch/arm/configs/sama7_defconfig b/arch/arm/configs/sama7_defconfig
index 8b2cf6ddd568..c200de3947e3 100644
--- a/arch/arm/configs/sama7_defconfig
+++ b/arch/arm/configs/sama7_defconfig
@@ -19,7 +19,7 @@ CONFIG_ATMEL_CLOCKSOURCE_TCB=y
# CONFIG_CACHE_L2X0 is not set
# CONFIG_ARM_PATCH_IDIV is not set
# CONFIG_CPU_SW_DOMAIN_PAN is not set
-CONFIG_ARCH_FORCE_MAX_ORDER=15
+CONFIG_ARCH_FORCE_MAX_ORDER=14
CONFIG_UACCESS_WITH_MEMCPY=y
# CONFIG_ATAGS is not set
CONFIG_CMDLINE="console=ttyS0,115200 earlyprintk ignore_loglevel"
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index c6fcd8746f60..1afcfc9d2dc0 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -1403,25 +1403,23 @@ config XEN

config ARCH_FORCE_MAX_ORDER
int
- default "14" if ARM64_64K_PAGES
- default "12" if ARM64_16K_PAGES
- default "11"
+ default "13" if ARM64_64K_PAGES
+ default "11" if ARM64_16K_PAGES
+ default "10"
help
The kernel memory allocator divides physically contiguous memory
blocks into "zones", where each zone is a power of two number of
pages. This option selects the largest power of two that the kernel
keeps in the memory allocator. If you need to allocate very large
blocks of physically contiguous memory, then you may need to
- increase this value.
-
- This config option is actually maximum order plus one. For example,
- a value of 11 means that the largest free memory block is 2^10 pages.
+ increase this value. A value of 10 means that the largest free memory
+ block is 2^10 pages.

We make sure that we can allocate upto a HugePage size for each configuration.
Hence we have :
- MAX_ORDER = (PMD_SHIFT - PAGE_SHIFT) + 1 => PAGE_SHIFT - 2
+ MAX_ORDER = PMD_SHIFT - PAGE_SHIFT = PAGE_SHIFT - 3

- However for 4K, we choose a higher default value, 11 as opposed to 10, giving us
+ However for 4K, we choose a higher default value, 10 as opposed to 9, giving us
4M allocations matching the default size used by generic code.

config UNMAP_KERNEL_AT_EL0
diff --git a/arch/arm64/include/asm/sparsemem.h b/arch/arm64/include/asm/sparsemem.h
index 4b73463423c3..5f5437621029 100644
--- a/arch/arm64/include/asm/sparsemem.h
+++ b/arch/arm64/include/asm/sparsemem.h
@@ -10,7 +10,7 @@
/*
* Section size must be at least 512MB for 64K base
* page size config. Otherwise it will be less than
- * (MAX_ORDER - 1) and the build process will fail.
+ * MAX_ORDER and the build process will fail.
*/
#ifdef CONFIG_ARM64_64K_PAGES
#define SECTION_SIZE_BITS 29
diff --git a/arch/arm64/kvm/hyp/include/nvhe/gfp.h b/arch/arm64/kvm/hyp/include/nvhe/gfp.h
index 0a048dc06a7d..fe5472a184a3 100644
--- a/arch/arm64/kvm/hyp/include/nvhe/gfp.h
+++ b/arch/arm64/kvm/hyp/include/nvhe/gfp.h
@@ -16,7 +16,7 @@ struct hyp_pool {
* API at EL2.
*/
hyp_spinlock_t lock;
- struct list_head free_area[MAX_ORDER];
+ struct list_head free_area[MAX_ORDER + 1];
phys_addr_t range_start;
phys_addr_t range_end;
unsigned short max_order;
diff --git a/arch/csky/Kconfig b/arch/csky/Kconfig
index adee6ab36862..a35fc882e97e 100644
--- a/arch/csky/Kconfig
+++ b/arch/csky/Kconfig
@@ -334,7 +334,7 @@ config HIGHMEM

config ARCH_FORCE_MAX_ORDER
int "Maximum zone order"
- default "11"
+ default "10"

config DRAM_BASE
hex "DRAM start addr (the same with memory-section in dts)"
diff --git a/arch/ia64/Kconfig b/arch/ia64/Kconfig
index c6e06cdc738f..d85f6fbd0746 100644
--- a/arch/ia64/Kconfig
+++ b/arch/ia64/Kconfig
@@ -201,10 +201,10 @@ config IA64_CYCLONE
If you're unsure, answer N.

config ARCH_FORCE_MAX_ORDER
- int "MAX_ORDER (11 - 17)" if !HUGETLB_PAGE
- range 11 17 if !HUGETLB_PAGE
- default "17" if HUGETLB_PAGE
- default "11"
+ int "MAX_ORDER (10 - 16)" if !HUGETLB_PAGE
+ range 10 16 if !HUGETLB_PAGE
+ default "16" if HUGETLB_PAGE
+ default "10"

config SMP
bool "Symmetric multi-processing support"
diff --git a/arch/ia64/include/asm/sparsemem.h b/arch/ia64/include/asm/sparsemem.h
index 84e8ce387b69..04f03a56c166 100644
--- a/arch/ia64/include/asm/sparsemem.h
+++ b/arch/ia64/include/asm/sparsemem.h
@@ -12,9 +12,9 @@
#define SECTION_SIZE_BITS (30)
#define MAX_PHYSMEM_BITS (50)
#ifdef CONFIG_ARCH_FORCE_MAX_ORDER
-#if ((CONFIG_ARCH_FORCE_MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS)
+#if ((CONFIG_ARCH_FORCE_MAX_ORDER + PAGE_SHIFT) > SECTION_SIZE_BITS)
#undef SECTION_SIZE_BITS
-#define SECTION_SIZE_BITS (CONFIG_ARCH_FORCE_MAX_ORDER - 1 + PAGE_SHIFT)
+#define SECTION_SIZE_BITS (CONFIG_ARCH_FORCE_MAX_ORDER + PAGE_SHIFT)
#endif
#endif

diff --git a/arch/ia64/mm/hugetlbpage.c b/arch/ia64/mm/hugetlbpage.c
index f993cb36c062..87cc2e8908b4 100644
--- a/arch/ia64/mm/hugetlbpage.c
+++ b/arch/ia64/mm/hugetlbpage.c
@@ -185,7 +185,7 @@ static int __init hugetlb_setup_sz(char *str)
size = memparse(str, &str);
if (*str || !is_power_of_2(size) || !(tr_pages & size) ||
size <= PAGE_SIZE ||
- size >= (1UL << PAGE_SHIFT << MAX_ORDER)) {
+ size > (1UL << PAGE_SHIFT << MAX_ORDER)) {
printk(KERN_WARNING "Invalid huge page size specified\n");
return 1;
}
diff --git a/arch/m68k/Kconfig.cpu b/arch/m68k/Kconfig.cpu
index 3b2f39508524..d3832e1ca7df 100644
--- a/arch/m68k/Kconfig.cpu
+++ b/arch/m68k/Kconfig.cpu
@@ -402,22 +402,20 @@ config SINGLE_MEMORY_CHUNK
config ARCH_FORCE_MAX_ORDER
int "Maximum zone order" if ADVANCED
depends on !SINGLE_MEMORY_CHUNK
- default "11"
+ default "10"
help
The kernel memory allocator divides physically contiguous memory
blocks into "zones", where each zone is a power of two number of
pages. This option selects the largest power of two that the kernel
keeps in the memory allocator. If you need to allocate very large
blocks of physically contiguous memory, then you may need to
- increase this value.
+ increase this value. A value of 10 means that the largest free memory
+ block is 2^10 pages.

For systems that have holes in their physical address space this
value also defines the minimal size of the hole that allows
freeing unused memory map.

- This config option is actually maximum order plus one. For example,
- a value of 11 means that the largest free memory block is 2^10 pages.
-
config 060_WRITETHROUGH
bool "Use write-through caching for 68060 supervisor accesses"
depends on ADVANCED && M68060
diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index 70d28976a40d..37116c811e60 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -2142,24 +2142,22 @@ endchoice

config ARCH_FORCE_MAX_ORDER
int "Maximum zone order"
- range 14 64 if MIPS_HUGE_TLB_SUPPORT && PAGE_SIZE_64KB
- default "14" if MIPS_HUGE_TLB_SUPPORT && PAGE_SIZE_64KB
- range 13 64 if MIPS_HUGE_TLB_SUPPORT && PAGE_SIZE_32KB
- default "13" if MIPS_HUGE_TLB_SUPPORT && PAGE_SIZE_32KB
- range 12 64 if MIPS_HUGE_TLB_SUPPORT && PAGE_SIZE_16KB
- default "12" if MIPS_HUGE_TLB_SUPPORT && PAGE_SIZE_16KB
- range 0 64
- default "11"
+ range 13 63 if MIPS_HUGE_TLB_SUPPORT && PAGE_SIZE_64KB
+ default "13" if MIPS_HUGE_TLB_SUPPORT && PAGE_SIZE_64KB
+ range 12 63 if MIPS_HUGE_TLB_SUPPORT && PAGE_SIZE_32KB
+ default "12" if MIPS_HUGE_TLB_SUPPORT && PAGE_SIZE_32KB
+ range 11 63 if MIPS_HUGE_TLB_SUPPORT && PAGE_SIZE_16KB
+ default "11" if MIPS_HUGE_TLB_SUPPORT && PAGE_SIZE_16KB
+ range 0 63
+ default "10"
help
The kernel memory allocator divides physically contiguous memory
blocks into "zones", where each zone is a power of two number of
pages. This option selects the largest power of two that the kernel
keeps in the memory allocator. If you need to allocate very large
blocks of physically contiguous memory, then you may need to
- increase this value.
-
- This config option is actually maximum order plus one. For example,
- a value of 11 means that the largest free memory block is 2^10 pages.
+ increase this value. A value of 10 means that the largest free memory
+ block is 2^10 pages.

The page size is not necessarily 4KB. Keep this in mind
when choosing a value for this option.
diff --git a/arch/nios2/Kconfig b/arch/nios2/Kconfig
index a582f72104f3..0cccaf8b7fdf 100644
--- a/arch/nios2/Kconfig
+++ b/arch/nios2/Kconfig
@@ -46,18 +46,16 @@ source "kernel/Kconfig.hz"

config ARCH_FORCE_MAX_ORDER
int "Maximum zone order"
- range 9 20
- default "11"
+ range 8 19
+ default "10"
help
The kernel memory allocator divides physically contiguous memory
blocks into "zones", where each zone is a power of two number of
pages. This option selects the largest power of two that the kernel
keeps in the memory allocator. If you need to allocate very large
blocks of physically contiguous memory, then you may need to
- increase this value.
-
- This config option is actually maximum order plus one. For example,
- a value of 11 means that the largest free memory block is 2^10 pages.
+ increase this value. A value of 10 means that the largest free memory
+ block is 2^10 pages.

endmenu

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 39d71d7701bd..d052cf27883e 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -847,28 +847,26 @@ config DATA_SHIFT

config ARCH_FORCE_MAX_ORDER
int "Maximum zone order"
- range 8 9 if PPC64 && PPC_64K_PAGES
- default "9" if PPC64 && PPC_64K_PAGES
- range 13 13 if PPC64 && !PPC_64K_PAGES
- default "13" if PPC64 && !PPC_64K_PAGES
- range 9 64 if PPC32 && PPC_16K_PAGES
- default "9" if PPC32 && PPC_16K_PAGES
- range 7 64 if PPC32 && PPC_64K_PAGES
- default "7" if PPC32 && PPC_64K_PAGES
- range 5 64 if PPC32 && PPC_256K_PAGES
- default "5" if PPC32 && PPC_256K_PAGES
- range 11 64
- default "11"
+ range 7 8 if PPC64 && PPC_64K_PAGES
+ default "8" if PPC64 && PPC_64K_PAGES
+ range 12 12 if PPC64 && !PPC_64K_PAGES
+ default "12" if PPC64 && !PPC_64K_PAGES
+ range 8 63 if PPC32 && PPC_16K_PAGES
+ default "8" if PPC32 && PPC_16K_PAGES
+ range 6 63 if PPC32 && PPC_64K_PAGES
+ default "6" if PPC32 && PPC_64K_PAGES
+ range 4 63 if PPC32 && PPC_256K_PAGES
+ default "4" if PPC32 && PPC_256K_PAGES
+ range 10 63
+ default "10"
help
The kernel memory allocator divides physically contiguous memory
blocks into "zones", where each zone is a power of two number of
pages. This option selects the largest power of two that the kernel
keeps in the memory allocator. If you need to allocate very large
blocks of physically contiguous memory, then you may need to
- increase this value.
-
- This config option is actually maximum order plus one. For example,
- a value of 11 means that the largest free memory block is 2^10 pages.
+ increase this value. A value of 11 means that the largest free memory
+ block is 2^10 pages.

The page size is not necessarily 4KB. For example, on 64-bit
systems, 64KB pages can be enabled via CONFIG_PPC_64K_PAGES. Keep
diff --git a/arch/powerpc/configs/85xx/ge_imp3a_defconfig b/arch/powerpc/configs/85xx/ge_imp3a_defconfig
index e7672c186325..b8be8280a200 100644
--- a/arch/powerpc/configs/85xx/ge_imp3a_defconfig
+++ b/arch/powerpc/configs/85xx/ge_imp3a_defconfig
@@ -30,7 +30,7 @@ CONFIG_PREEMPT=y
# CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS is not set
CONFIG_BINFMT_MISC=m
CONFIG_MATH_EMULATION=y
-CONFIG_ARCH_FORCE_MAX_ORDER=17
+CONFIG_ARCH_FORCE_MAX_ORDER=16
CONFIG_PCI=y
CONFIG_PCIEPORTBUS=y
CONFIG_PCI_MSI=y
diff --git a/arch/powerpc/configs/fsl-emb-nonhw.config b/arch/powerpc/configs/fsl-emb-nonhw.config
index ab8a8c4530d9..3009b0efaf34 100644
--- a/arch/powerpc/configs/fsl-emb-nonhw.config
+++ b/arch/powerpc/configs/fsl-emb-nonhw.config
@@ -41,7 +41,7 @@ CONFIG_FIXED_PHY=y
CONFIG_FONT_8x16=y
CONFIG_FONT_8x8=y
CONFIG_FONTS=y
-CONFIG_ARCH_FORCE_MAX_ORDER=13
+CONFIG_ARCH_FORCE_MAX_ORDER=12
CONFIG_FRAMEBUFFER_CONSOLE=y
CONFIG_FRAME_WARN=1024
CONFIG_FTL=y
diff --git a/arch/powerpc/mm/book3s64/iommu_api.c b/arch/powerpc/mm/book3s64/iommu_api.c
index 7fcfba162e0d..81d7185e2ae8 100644
--- a/arch/powerpc/mm/book3s64/iommu_api.c
+++ b/arch/powerpc/mm/book3s64/iommu_api.c
@@ -97,7 +97,7 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
}

mmap_read_lock(mm);
- chunk = (1UL << (PAGE_SHIFT + MAX_ORDER - 1)) /
+ chunk = (1UL << (PAGE_SHIFT + MAX_ORDER)) /
sizeof(struct vm_area_struct *);
chunk = min(chunk, entries);
for (entry = 0; entry < entries; entry += chunk) {
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index bc84a594ca62..8d63934783dc 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -652,7 +652,7 @@ void __init gigantic_hugetlb_cma_reserve(void)
order = mmu_psize_to_shift(MMU_PAGE_16G) - PAGE_SHIFT;

if (order) {
- VM_WARN_ON(order < MAX_ORDER);
+ VM_WARN_ON(order <= MAX_ORDER);
hugetlb_cma_reserve(order);
}
}
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 9de9b2fb163d..8e29a57924ef 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1740,7 +1740,7 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
* DMA window can be larger than available memory, which will
* cause errors later.
*/
- const u64 maxblock = 1UL << (PAGE_SHIFT + MAX_ORDER - 1);
+ const u64 maxblock = 1UL << (PAGE_SHIFT + MAX_ORDER);

/*
* We create the default window as big as we can. The constraint is
diff --git a/arch/sh/configs/ecovec24_defconfig b/arch/sh/configs/ecovec24_defconfig
index b52e14ccb450..4d655e8d4d74 100644
--- a/arch/sh/configs/ecovec24_defconfig
+++ b/arch/sh/configs/ecovec24_defconfig
@@ -8,7 +8,7 @@ CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
# CONFIG_BLK_DEV_BSG is not set
CONFIG_CPU_SUBTYPE_SH7724=y
-CONFIG_ARCH_FORCE_MAX_ORDER=12
+CONFIG_ARCH_FORCE_MAX_ORDER=11
CONFIG_MEMORY_SIZE=0x10000000
CONFIG_FLATMEM_MANUAL=y
CONFIG_SH_ECOVEC=y
diff --git a/arch/sh/mm/Kconfig b/arch/sh/mm/Kconfig
index 411fdc0901f7..e60e77c6edca 100644
--- a/arch/sh/mm/Kconfig
+++ b/arch/sh/mm/Kconfig
@@ -20,23 +20,21 @@ config PAGE_OFFSET

config ARCH_FORCE_MAX_ORDER
int "Maximum zone order"
- range 9 64 if PAGE_SIZE_16KB
- default "9" if PAGE_SIZE_16KB
- range 7 64 if PAGE_SIZE_64KB
- default "7" if PAGE_SIZE_64KB
- range 11 64
- default "14" if !MMU
- default "11"
+ range 8 63 if PAGE_SIZE_16KB
+ default "8" if PAGE_SIZE_16KB
+ range 6 63 if PAGE_SIZE_64KB
+ default "6" if PAGE_SIZE_64KB
+ range 10 63
+ default "13" if !MMU
+ default "10"
help
The kernel memory allocator divides physically contiguous memory
blocks into "zones", where each zone is a power of two number of
pages. This option selects the largest power of two that the kernel
keeps in the memory allocator. If you need to allocate very large
blocks of physically contiguous memory, then you may need to
- increase this value.
-
- This config option is actually maximum order plus one. For example,
- a value of 11 means that the largest free memory block is 2^10 pages.
+ increase this value. A value of 10 means that the largest free memory
+ block is 2^10 pages.

The page size is not necessarily 4KB. Keep this in mind when
choosing a value for this option.
diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index 4d3d1af90d52..099d0b31ea69 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -271,17 +271,15 @@ config ARCH_SPARSEMEM_DEFAULT

config ARCH_FORCE_MAX_ORDER
int "Maximum zone order"
- default "13"
+ default "12"
help
The kernel memory allocator divides physically contiguous memory
blocks into "zones", where each zone is a power of two number of
pages. This option selects the largest power of two that the kernel
keeps in the memory allocator. If you need to allocate very large
blocks of physically contiguous memory, then you may need to
- increase this value.
-
- This config option is actually maximum order plus one. For example,
- a value of 13 means that the largest free memory block is 2^12 pages.
+ increase this value. A value of 12 means that the largest free memory
+ block is 2^12 pages.

if SPARC64
source "kernel/power/Kconfig"
diff --git a/arch/sparc/kernel/pci_sun4v.c b/arch/sparc/kernel/pci_sun4v.c
index 384480971805..7d91ca6aa675 100644
--- a/arch/sparc/kernel/pci_sun4v.c
+++ b/arch/sparc/kernel/pci_sun4v.c
@@ -193,7 +193,7 @@ static void *dma_4v_alloc_coherent(struct device *dev, size_t size,

size = IO_PAGE_ALIGN(size);
order = get_order(size);
- if (unlikely(order >= MAX_ORDER))
+ if (unlikely(order > MAX_ORDER))
return NULL;

npages = size >> IO_PAGE_SHIFT;
diff --git a/arch/sparc/kernel/traps_64.c b/arch/sparc/kernel/traps_64.c
index 5b4de4a89dec..08ffd17d5ec3 100644
--- a/arch/sparc/kernel/traps_64.c
+++ b/arch/sparc/kernel/traps_64.c
@@ -897,7 +897,7 @@ void __init cheetah_ecache_flush_init(void)

/* Now allocate error trap reporting scoreboard. */
sz = NR_CPUS * (2 * sizeof(struct cheetah_err_info));
- for (order = 0; order < MAX_ORDER; order++) {
+ for (order = 0; order <= MAX_ORDER; order++) {
if ((PAGE_SIZE << order) >= sz)
break;
}
diff --git a/arch/xtensa/Kconfig b/arch/xtensa/Kconfig
index bcb0c5d2abc2..2d1d91718263 100644
--- a/arch/xtensa/Kconfig
+++ b/arch/xtensa/Kconfig
@@ -773,17 +773,15 @@ config HIGHMEM

config ARCH_FORCE_MAX_ORDER
int "Maximum zone order"
- default "11"
+ default "10"
help
The kernel memory allocator divides physically contiguous memory
blocks into "zones", where each zone is a power of two number of
pages. This option selects the largest power of two that the kernel
keeps in the memory allocator. If you need to allocate very large
blocks of physically contiguous memory, then you may need to
- increase this value.
-
- This config option is actually maximum order plus one. For example,
- a value of 11 means that the largest free memory block is 2^10 pages.
+ increase this value. A value of 10 means that the largest free memory
+ block is 2^10 pages.

endmenu

diff --git a/drivers/base/regmap/regmap-debugfs.c b/drivers/base/regmap/regmap-debugfs.c
index 817eda2075aa..c491fabe3617 100644
--- a/drivers/base/regmap/regmap-debugfs.c
+++ b/drivers/base/regmap/regmap-debugfs.c
@@ -226,8 +226,8 @@ static ssize_t regmap_read_debugfs(struct regmap *map, unsigned int from,
if (*ppos < 0 || !count)
return -EINVAL;

- if (count > (PAGE_SIZE << (MAX_ORDER - 1)))
- count = PAGE_SIZE << (MAX_ORDER - 1);
+ if (count > (PAGE_SIZE << MAX_ORDER))
+ count = PAGE_SIZE << MAX_ORDER;

buf = kmalloc(count, GFP_KERNEL);
if (!buf)
@@ -373,8 +373,8 @@ static ssize_t regmap_reg_ranges_read_file(struct file *file,
if (*ppos < 0 || !count)
return -EINVAL;

- if (count > (PAGE_SIZE << (MAX_ORDER - 1)))
- count = PAGE_SIZE << (MAX_ORDER - 1);
+ if (count > (PAGE_SIZE << MAX_ORDER))
+ count = PAGE_SIZE << MAX_ORDER;

buf = kmalloc(count, GFP_KERNEL);
if (!buf)
diff --git a/drivers/crypto/hisilicon/sgl.c b/drivers/crypto/hisilicon/sgl.c
index 2b6f2281cfd6..f30cf96b0a41 100644
--- a/drivers/crypto/hisilicon/sgl.c
+++ b/drivers/crypto/hisilicon/sgl.c
@@ -70,11 +70,11 @@ struct hisi_acc_sgl_pool *hisi_acc_create_sgl_pool(struct device *dev,
HISI_ACC_SGL_ALIGN_SIZE);

/*
- * the pool may allocate a block of memory of size PAGE_SIZE * 2^(MAX_ORDER - 1),
+ * the pool may allocate a block of memory of size PAGE_SIZE * 2^MAX_ORDER,
* block size may exceed 2^31 on ia64, so the max of block size is 2^31
*/
- block_size = 1 << (PAGE_SHIFT + MAX_ORDER <= 32 ?
- PAGE_SHIFT + MAX_ORDER - 1 : 31);
+ block_size = 1 << (PAGE_SHIFT + MAX_ORDER <= 31 ?
+ PAGE_SHIFT + MAX_ORDER : 31);
sgl_num_per_block = block_size / sgl_size;
block_num = count / sgl_num_per_block;
remain_sgl = count % sgl_num_per_block;
diff --git a/drivers/gpu/drm/i915/gem/selftests/huge_pages.c b/drivers/gpu/drm/i915/gem/selftests/huge_pages.c
index 72ce2c9f42fd..84498c7f845d 100644
--- a/drivers/gpu/drm/i915/gem/selftests/huge_pages.c
+++ b/drivers/gpu/drm/i915/gem/selftests/huge_pages.c
@@ -111,7 +111,7 @@ static int get_huge_pages(struct drm_i915_gem_object *obj)
do {
struct page *page;

- GEM_BUG_ON(order >= MAX_ORDER);
+ GEM_BUG_ON(order > MAX_ORDER);
page = alloc_pages(GFP | __GFP_ZERO, order);
if (!page)
goto err;
diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
index 21b61631f73a..85d19f425af6 100644
--- a/drivers/gpu/drm/ttm/ttm_pool.c
+++ b/drivers/gpu/drm/ttm/ttm_pool.c
@@ -64,11 +64,11 @@ module_param(page_pool_size, ulong, 0644);

static atomic_long_t allocated_pages;

-static struct ttm_pool_type global_write_combined[MAX_ORDER];
-static struct ttm_pool_type global_uncached[MAX_ORDER];
+static struct ttm_pool_type global_write_combined[MAX_ORDER + 1];
+static struct ttm_pool_type global_uncached[MAX_ORDER + 1];

-static struct ttm_pool_type global_dma32_write_combined[MAX_ORDER];
-static struct ttm_pool_type global_dma32_uncached[MAX_ORDER];
+static struct ttm_pool_type global_dma32_write_combined[MAX_ORDER + 1];
+static struct ttm_pool_type global_dma32_uncached[MAX_ORDER + 1];

static spinlock_t shrinker_lock;
static struct list_head shrinker_list;
@@ -382,7 +382,7 @@ int ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
else
gfp_flags |= GFP_HIGHUSER;

- for (order = min_t(unsigned int, MAX_ORDER - 1, __fls(num_pages));
+ for (order = min_t(unsigned int, MAX_ORDER, __fls(num_pages));
num_pages;
order = min_t(unsigned int, order, __fls(num_pages))) {
bool apply_caching = false;
@@ -507,7 +507,7 @@ void ttm_pool_init(struct ttm_pool *pool, struct device *dev,

if (use_dma_alloc) {
for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i)
- for (j = 0; j < MAX_ORDER; ++j)
+ for (j = 0; j <= MAX_ORDER; ++j)
ttm_pool_type_init(&pool->caching[i].orders[j],
pool, i, j);
}
@@ -527,7 +527,7 @@ void ttm_pool_fini(struct ttm_pool *pool)

if (pool->use_dma_alloc) {
for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i)
- for (j = 0; j < MAX_ORDER; ++j)
+ for (j = 0; j <= MAX_ORDER; ++j)
ttm_pool_type_fini(&pool->caching[i].orders[j]);
}

@@ -581,7 +581,7 @@ static void ttm_pool_debugfs_header(struct seq_file *m)
unsigned int i;

seq_puts(m, "\t ");
- for (i = 0; i < MAX_ORDER; ++i)
+ for (i = 0; i <= MAX_ORDER; ++i)
seq_printf(m, " ---%2u---", i);
seq_puts(m, "\n");
}
@@ -592,7 +592,7 @@ static void ttm_pool_debugfs_orders(struct ttm_pool_type *pt,
{
unsigned int i;

- for (i = 0; i < MAX_ORDER; ++i)
+ for (i = 0; i <= MAX_ORDER; ++i)
seq_printf(m, " %8u", ttm_pool_type_count(&pt[i]));
seq_puts(m, "\n");
}
@@ -701,7 +701,7 @@ int ttm_pool_mgr_init(unsigned long num_pages)
spin_lock_init(&shrinker_lock);
INIT_LIST_HEAD(&shrinker_list);

- for (i = 0; i < MAX_ORDER; ++i) {
+ for (i = 0; i <= MAX_ORDER; ++i) {
ttm_pool_type_init(&global_write_combined[i], NULL,
ttm_write_combined, i);
ttm_pool_type_init(&global_uncached[i], NULL, ttm_uncached, i);
@@ -734,7 +734,7 @@ void ttm_pool_mgr_fini(void)
{
unsigned int i;

- for (i = 0; i < MAX_ORDER; ++i) {
+ for (i = 0; i <= MAX_ORDER; ++i) {
ttm_pool_type_fini(&global_write_combined[i]);
ttm_pool_type_fini(&global_uncached[i]);

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index cd48590ada30..c5ea361bf757 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -182,7 +182,7 @@
#ifdef CONFIG_CMA_ALIGNMENT
#define Q_MAX_SZ_SHIFT (PAGE_SHIFT + CONFIG_CMA_ALIGNMENT)
#else
-#define Q_MAX_SZ_SHIFT (PAGE_SHIFT + MAX_ORDER - 1)
+#define Q_MAX_SZ_SHIFT (PAGE_SHIFT + MAX_ORDER)
#endif

/*
diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c
index 5ff09de6c48f..c867432919d8 100644
--- a/drivers/irqchip/irq-gic-v3-its.c
+++ b/drivers/irqchip/irq-gic-v3-its.c
@@ -2438,8 +2438,8 @@ static bool its_parse_indirect_baser(struct its_node *its,
* feature is not supported by hardware.
*/
new_order = max_t(u32, get_order(esz << ids), new_order);
- if (new_order >= MAX_ORDER) {
- new_order = MAX_ORDER - 1;
+ if (new_order > MAX_ORDER) {
+ new_order = MAX_ORDER;
ids = ilog2(PAGE_ORDER_TO_SIZE(new_order) / (int)esz);
pr_warn("ITS@%pa: %s Table too large, reduce ids %llu->%u\n",
&its->phys_base, its_base_type_string[type],
diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
index acd6d6b47434..eee05abbc0be 100644
--- a/drivers/md/dm-bufio.c
+++ b/drivers/md/dm-bufio.c
@@ -407,7 +407,7 @@ static void __cache_size_refresh(void)
* If the allocation may fail we use __get_free_pages. Memory fragmentation
* won't have a fatal effect here, but it just causes flushes of some other
* buffers and more I/O will be performed. Don't use __get_free_pages if it
- * always fails (i.e. order >= MAX_ORDER).
+ * always fails (i.e. order > MAX_ORDER).
*
* If the allocation shouldn't fail we use __vmalloc. This is only for the
* initial reserve allocation, so there's no risk of wasting all vmalloc
diff --git a/drivers/misc/genwqe/card_utils.c b/drivers/misc/genwqe/card_utils.c
index 1167463f26fb..361514cd575c 100644
--- a/drivers/misc/genwqe/card_utils.c
+++ b/drivers/misc/genwqe/card_utils.c
@@ -210,7 +210,7 @@ u32 genwqe_crc32(u8 *buff, size_t len, u32 init)
void *__genwqe_alloc_consistent(struct genwqe_dev *cd, size_t size,
dma_addr_t *dma_handle)
{
- if (get_order(size) >= MAX_ORDER)
+ if (get_order(size) > MAX_ORDER)
return NULL;

return dma_alloc_coherent(&cd->pci_dev->dev, size, dma_handle,
diff --git a/drivers/net/ethernet/ibm/ibmvnic.h b/drivers/net/ethernet/ibm/ibmvnic.h
index e5c6ff3d0c47..608f9df67eb8 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.h
+++ b/drivers/net/ethernet/ibm/ibmvnic.h
@@ -75,7 +75,7 @@
* pool for the 4MB. Thus the 16 Rx and Tx queues require 32 * 5 = 160
* plus 16 for the TSO pools for a total of 176 LTB mappings per VNIC.
*/
-#define IBMVNIC_ONE_LTB_MAX ((u32)((1 << (MAX_ORDER - 1)) * PAGE_SIZE))
+#define IBMVNIC_ONE_LTB_MAX ((u32)((1 << MAX_ORDER) * PAGE_SIZE))
#define IBMVNIC_ONE_LTB_SIZE min((u32)(8 << 20), IBMVNIC_ONE_LTB_MAX)
#define IBMVNIC_LTB_SET_SIZE (38 << 20)

diff --git a/drivers/video/fbdev/hyperv_fb.c b/drivers/video/fbdev/hyperv_fb.c
index 886c564787f1..a852ab6c1f52 100644
--- a/drivers/video/fbdev/hyperv_fb.c
+++ b/drivers/video/fbdev/hyperv_fb.c
@@ -944,8 +944,8 @@ static phys_addr_t hvfb_get_phymem(struct hv_device *hdev,
if (request_size == 0)
return -1;

- if (order < MAX_ORDER) {
- /* Call alloc_pages if the size is less than 2^MAX_ORDER */
+ if (order <= MAX_ORDER) {
+ /* Call alloc_pages if the size is no greater than 2^MAX_ORDER */
page = alloc_pages(GFP_KERNEL | __GFP_ZERO, order);
if (!page)
return -1;
@@ -975,7 +975,7 @@ static void hvfb_release_phymem(struct hv_device *hdev,
{
unsigned int order = get_order(size);

- if (order < MAX_ORDER)
+ if (order <= MAX_ORDER)
__free_pages(pfn_to_page(paddr >> PAGE_SHIFT), order);
else
dma_free_coherent(&hdev->device,
diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 3f78a3a1eb75..5b15936a5214 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -33,7 +33,7 @@
#define VIRTIO_BALLOON_FREE_PAGE_ALLOC_FLAG (__GFP_NORETRY | __GFP_NOWARN | \
__GFP_NOMEMALLOC)
/* The order of free page blocks to report to host */
-#define VIRTIO_BALLOON_HINT_BLOCK_ORDER (MAX_ORDER - 1)
+#define VIRTIO_BALLOON_HINT_BLOCK_ORDER MAX_ORDER
/* The size of a free page block in bytes */
#define VIRTIO_BALLOON_HINT_BLOCK_BYTES \
(1 << (VIRTIO_BALLOON_HINT_BLOCK_ORDER + PAGE_SHIFT))
diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index 0c2892ec6817..0e1253e3423a 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -1120,13 +1120,13 @@ static void virtio_mem_clear_fake_offline(unsigned long pfn,
*/
static void virtio_mem_fake_online(unsigned long pfn, unsigned long nr_pages)
{
- unsigned long order = MAX_ORDER - 1;
+ unsigned long order = MAX_ORDER;
unsigned long i;

/*
* We might get called for ranges that don't cover properly aligned
- * MAX_ORDER - 1 pages; however, we can only online properly aligned
- * pages with an order of MAX_ORDER - 1 at maximum.
+ * MAX_ORDER pages; however, we can only online properly aligned
+ * pages with an order of MAX_ORDER at maximum.
*/
while (!IS_ALIGNED(pfn | nr_pages, 1 << order))
order--;
@@ -1237,7 +1237,7 @@ static void virtio_mem_online_page(struct virtio_mem *vm,
bool do_online;

/*
- * We can get called with any order up to MAX_ORDER - 1. If our
+ * We can get called with any order up to MAX_ORDER. If our
* subblock size is smaller than that and we have a mixture of plugged
* and unplugged subblocks within such a page, we have to process in
* smaller granularity. In that case we'll adjust the order exactly once
diff --git a/fs/ramfs/file-nommu.c b/fs/ramfs/file-nommu.c
index ba3525ccc27e..b3b7519a6519 100644
--- a/fs/ramfs/file-nommu.c
+++ b/fs/ramfs/file-nommu.c
@@ -70,7 +70,7 @@ int ramfs_nommu_expand_for_mapping(struct inode *inode, size_t newsize)

/* make various checks */
order = get_order(newsize);
- if (unlikely(order >= MAX_ORDER))
+ if (unlikely(order > MAX_ORDER))
return -EFBIG;

ret = inode_newsize_ok(inode, newsize);
diff --git a/include/drm/ttm/ttm_pool.h b/include/drm/ttm/ttm_pool.h
index ef09b23d29e3..8ce14f9d202a 100644
--- a/include/drm/ttm/ttm_pool.h
+++ b/include/drm/ttm/ttm_pool.h
@@ -72,7 +72,7 @@ struct ttm_pool {
bool use_dma32;

struct {
- struct ttm_pool_type orders[MAX_ORDER];
+ struct ttm_pool_type orders[MAX_ORDER + 1];
} caching[TTM_NUM_CACHING_TYPES];
};

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 3ec981a0d8b3..68485a264865 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -746,7 +746,7 @@ static inline unsigned huge_page_shift(struct hstate *h)

static inline bool hstate_is_gigantic(struct hstate *h)
{
- return huge_page_order(h) >= MAX_ORDER;
+ return huge_page_order(h) > MAX_ORDER;
}

static inline unsigned int pages_per_huge_page(const struct hstate *h)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ca285ed3c6e0..e93faa3d7f1d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -25,11 +25,11 @@

/* Free memory management - zoned buddy allocator. */
#ifndef CONFIG_ARCH_FORCE_MAX_ORDER
-#define MAX_ORDER 11
+#define MAX_ORDER 10
#else
#define MAX_ORDER CONFIG_ARCH_FORCE_MAX_ORDER
#endif
-#define MAX_ORDER_NR_PAGES (1 << (MAX_ORDER - 1))
+#define MAX_ORDER_NR_PAGES (1 << MAX_ORDER)

/*
* PAGE_ALLOC_COSTLY_ORDER is the order at which allocations are deemed
@@ -92,7 +92,7 @@ static inline bool migratetype_is_mergeable(int mt)
}

#define for_each_migratetype_order(order, type) \
- for (order = 0; order < MAX_ORDER; order++) \
+ for (order = 0; order <= MAX_ORDER; order++) \
for (type = 0; type < MIGRATE_TYPES; type++)

extern int page_group_by_mobility_disabled;
@@ -632,7 +632,7 @@ struct zone {
ZONE_PADDING(_pad1_)

/* free areas of different sizes */
- struct free_area free_area[MAX_ORDER];
+ struct free_area free_area[MAX_ORDER + 1];

/* zone flags, see below */
unsigned long flags;
@@ -1379,7 +1379,7 @@ static inline bool movable_only_nodes(nodemask_t *nodes)
#define SECTION_BLOCKFLAGS_BITS \
((1UL << (PFN_SECTION_SHIFT - pageblock_order)) * NR_PAGEBLOCK_BITS)

-#if (MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS
+#if (MAX_ORDER + PAGE_SHIFT) > SECTION_SIZE_BITS
#error Allocator MAX_ORDER exceeds SECTION_SIZE
#endif

diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
index 83c7248053a1..940efcffd374 100644
--- a/include/linux/pageblock-flags.h
+++ b/include/linux/pageblock-flags.h
@@ -41,14 +41,14 @@ extern unsigned int pageblock_order;
* Huge pages are a constant size, but don't exceed the maximum allocation
* granularity.
*/
-#define pageblock_order min_t(unsigned int, HUGETLB_PAGE_ORDER, MAX_ORDER - 1)
+#define pageblock_order min_t(unsigned int, HUGETLB_PAGE_ORDER, MAX_ORDER)

#endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */

#else /* CONFIG_HUGETLB_PAGE */

/* If huge pages are not used, group by MAX_ORDER_NR_PAGES */
-#define pageblock_order (MAX_ORDER-1)
+#define pageblock_order MAX_ORDER

#endif /* CONFIG_HUGETLB_PAGE */

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 0fefdf528e0d..568b5dfb3bd9 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -251,8 +251,8 @@ static inline unsigned int arch_slab_minalign(void)
* to do various tricks to work around compiler limitations in order to
* ensure proper constant folding.
*/
-#define KMALLOC_SHIFT_HIGH ((MAX_ORDER + PAGE_SHIFT - 1) <= 25 ? \
- (MAX_ORDER + PAGE_SHIFT - 1) : 25)
+#define KMALLOC_SHIFT_HIGH ((MAX_ORDER + PAGE_SHIFT) <= 25 ? \
+ (MAX_ORDER + PAGE_SHIFT) : 25)
#define KMALLOC_SHIFT_MAX KMALLOC_SHIFT_HIGH
#ifndef KMALLOC_SHIFT_LOW
#define KMALLOC_SHIFT_LOW 5
@@ -265,7 +265,7 @@ static inline unsigned int arch_slab_minalign(void)
* (PAGE_SIZE*2). Larger requests are passed to the page allocator.
*/
#define KMALLOC_SHIFT_HIGH (PAGE_SHIFT + 1)
-#define KMALLOC_SHIFT_MAX (MAX_ORDER + PAGE_SHIFT - 1)
+#define KMALLOC_SHIFT_MAX (MAX_ORDER + PAGE_SHIFT)
#ifndef KMALLOC_SHIFT_LOW
#define KMALLOC_SHIFT_LOW 3
#endif
@@ -278,7 +278,7 @@ static inline unsigned int arch_slab_minalign(void)
* be allocated from the same page.
*/
#define KMALLOC_SHIFT_HIGH PAGE_SHIFT
-#define KMALLOC_SHIFT_MAX (MAX_ORDER + PAGE_SHIFT - 1)
+#define KMALLOC_SHIFT_MAX (MAX_ORDER + PAGE_SHIFT)
#ifndef KMALLOC_SHIFT_LOW
#define KMALLOC_SHIFT_LOW 3
#endif
diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index a0eb4d5cf557..245e2ee20718 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -471,7 +471,7 @@ static int __init crash_save_vmcoreinfo_init(void)
VMCOREINFO_OFFSET(list_head, prev);
VMCOREINFO_OFFSET(vmap_area, va_start);
VMCOREINFO_OFFSET(vmap_area, list);
- VMCOREINFO_LENGTH(zone.free_area, MAX_ORDER);
+ VMCOREINFO_LENGTH(zone.free_area, MAX_ORDER + 1);
log_buf_vmcoreinfo_setup();
VMCOREINFO_LENGTH(free_area.free_list, MIGRATE_TYPES);
VMCOREINFO_NUMBER(NR_FREE_PAGES);
diff --git a/kernel/dma/pool.c b/kernel/dma/pool.c
index 1bf6de398986..e20f168a34c7 100644
--- a/kernel/dma/pool.c
+++ b/kernel/dma/pool.c
@@ -84,8 +84,8 @@ static int atomic_pool_expand(struct gen_pool *pool, size_t pool_size,
void *addr;
int ret = -ENOMEM;

- /* Cannot allocate larger than MAX_ORDER-1 */
- order = min(get_order(pool_size), MAX_ORDER-1);
+ /* Cannot allocate larger than MAX_ORDER */
+ order = min(get_order(pool_size), MAX_ORDER);

do {
pool_size = 1 << (PAGE_SHIFT + order);
@@ -190,7 +190,7 @@ static int __init dma_atomic_pool_init(void)

/*
* If coherent_pool was not used on the command line, default the pool
- * sizes to 128KB per 1GB of memory, min 128KB, max MAX_ORDER-1.
+ * sizes to 128KB per 1GB of memory, min 128KB, max MAX_ORDER.
*/
if (!atomic_pool_size) {
unsigned long pages = totalram_pages() / (SZ_1G / SZ_128K);
diff --git a/mm/Kconfig b/mm/Kconfig
index 0331f1461f81..bbe31e85afee 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -307,7 +307,7 @@ config SHUFFLE_PAGE_ALLOCATOR
the presence of a memory-side-cache. There are also incidental
security benefits as it reduces the predictability of page
allocations to compliment SLAB_FREELIST_RANDOM, but the
- default granularity of shuffling on the "MAX_ORDER - 1" i.e,
+ default granularity of shuffling on the "MAX_ORDER" i.e,
10th order of pages is selected based on cache utilization
benefits on x86.

@@ -621,8 +621,8 @@ config HUGETLB_PAGE_SIZE_VARIABLE
HUGETLB_PAGE_ORDER when there are multiple HugeTLB page sizes available
on a platform.

- Note that the pageblock_order cannot exceed MAX_ORDER - 1 and will be
- clamped down to MAX_ORDER - 1.
+ Note that the pageblock_order cannot exceed MAX_ORDER and will be
+ clamped down to MAX_ORDER.

config CONTIG_ALLOC
def_bool (MEMORY_ISOLATION && COMPACTION) || CMA
diff --git a/mm/compaction.c b/mm/compaction.c
index 640fa76228dd..4a282c658ac4 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -586,7 +586,7 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
if (PageCompound(page)) {
const unsigned int order = compound_order(page);

- if (likely(order < MAX_ORDER)) {
+ if (likely(order <= MAX_ORDER)) {
blockpfn += (1UL << order) - 1;
cursor += (1UL << order) - 1;
}
@@ -941,7 +941,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
* a valid page order. Consider only values in the
* valid order range to prevent low_pfn overflow.
*/
- if (freepage_order > 0 && freepage_order < MAX_ORDER)
+ if (freepage_order > 0 && freepage_order <= MAX_ORDER)
low_pfn += (1UL << freepage_order) - 1;
continue;
}
@@ -957,7 +957,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
if (PageCompound(page) && !cc->alloc_contig) {
const unsigned int order = compound_order(page);

- if (likely(order < MAX_ORDER))
+ if (likely(order <= MAX_ORDER))
low_pfn += (1UL << order) - 1;
goto isolate_fail;
}
@@ -2118,7 +2118,7 @@ static enum compact_result __compact_finished(struct compact_control *cc)

/* Direct compactor: Is a suitable page free? */
ret = COMPACT_NO_SUITABLE_PAGE;
- for (order = cc->order; order < MAX_ORDER; order++) {
+ for (order = cc->order; order <= MAX_ORDER; order++) {
struct free_area *area = &cc->zone->free_area[order];
bool can_steal;

diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index dc7df1254f0a..7e53c4a42047 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -1094,7 +1094,7 @@ debug_vm_pgtable_alloc_huge_page(struct pgtable_debug_args *args, int order)
struct page *page = NULL;

#ifdef CONFIG_CONTIG_ALLOC
- if (order >= MAX_ORDER) {
+ if (order > MAX_ORDER) {
page = alloc_contig_pages((1 << order), GFP_KERNEL,
first_online_node, NULL);
if (page) {
@@ -1104,7 +1104,7 @@ debug_vm_pgtable_alloc_huge_page(struct pgtable_debug_args *args, int order)
}
#endif

- if (order < MAX_ORDER)
+ if (order <= MAX_ORDER)
page = alloc_pages(GFP_KERNEL, order);

return page;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3222b40a0f6d..9b1655950049 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -469,7 +469,7 @@ static int __init hugepage_init(void)
/*
* hugepages can't be allocated by the buddy allocator
*/
- MAYBE_BUILD_BUG_ON(HPAGE_PMD_ORDER >= MAX_ORDER);
+ MAYBE_BUILD_BUG_ON(HPAGE_PMD_ORDER > MAX_ORDER);
/*
* we use page->mapping and page->index in second tail page
* as list_head: assuming THP order >= 2
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 28516881a1b2..15ff582687a3 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1903,7 +1903,7 @@ pgoff_t hugetlb_basepage_index(struct page *page)
pgoff_t index = page_index(page_head);
unsigned long compound_idx;

- if (compound_order(page_head) >= MAX_ORDER)
+ if (compound_order(page_head) > MAX_ORDER)
compound_idx = page_to_pfn(page) - page_to_pfn(page_head);
else
compound_idx = page - page_head;
@@ -4313,7 +4313,7 @@ static int __init default_hugepagesz_setup(char *s)
* The number of default huge pages (for this size) could have been
* specified as the first hugetlb parameter: hugepages=X. If so,
* then default_hstate_max_huge_pages is set. If the default huge
- * page size is gigantic (>= MAX_ORDER), then the pages must be
+ * page size is gigantic (> MAX_ORDER), then the pages must be
* allocated here from bootmem allocator.
*/
if (default_hstate_max_huge_pages) {
diff --git a/mm/memblock.c b/mm/memblock.c
index b5d3026979fc..d1525463c05e 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -2030,7 +2030,7 @@ static void __init __free_pages_memory(unsigned long start, unsigned long end)
int order;

while (start < end) {
- order = min(MAX_ORDER - 1UL, __ffs(start));
+ order = min_t(unsigned long, MAX_ORDER, __ffs(start));

while (start + (1UL << order) > end)
order--;
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index fad6d1f2262a..5540499007ae 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -596,7 +596,7 @@ static void online_pages_range(unsigned long start_pfn, unsigned long nr_pages)
unsigned long pfn;

/*
- * Online the pages in MAX_ORDER - 1 aligned chunks. The callback might
+ * Online the pages in MAX_ORDER aligned chunks. The callback might
* decide to not expose all pages to the buddy (e.g., expose them
* later). We account all pages as being online and belonging to this
* zone ("present").
@@ -605,7 +605,7 @@ static void online_pages_range(unsigned long start_pfn, unsigned long nr_pages)
* this and the first chunk to online will be pageblock_nr_pages.
*/
for (pfn = start_pfn; pfn < end_pfn;) {
- int order = min(MAX_ORDER - 1UL, __ffs(pfn));
+ int order = min_t(unsigned long, MAX_ORDER, __ffs(pfn));

(*online_page_callback)(pfn_to_page(pfn), order);
pfn += (1UL << order);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7e030d7cac81..07ad8074950f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -847,7 +847,7 @@ static int __init debug_guardpage_minorder_setup(char *buf)
{
unsigned long res;

- if (kstrtoul(buf, 10, &res) < 0 || res > MAX_ORDER / 2) {
+ if (kstrtoul(buf, 10, &res) < 0 || res > (MAX_ORDER + 1) / 2) {
pr_err("Bad debug_guardpage_minorder value\n");
return 0;
}
@@ -1065,7 +1065,7 @@ buddy_merge_likely(unsigned long pfn, unsigned long buddy_pfn,
unsigned long higher_page_pfn;
struct page *higher_page;

- if (order >= MAX_ORDER - 2)
+ if (order >= MAX_ORDER - 1)
return false;

higher_page_pfn = buddy_pfn & pfn;
@@ -1120,7 +1120,7 @@ static inline void __free_one_page(struct page *page,
VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
VM_BUG_ON_PAGE(bad_range(zone, page), page);

- while (order < MAX_ORDER - 1) {
+ while (order < MAX_ORDER) {
if (compaction_capture(capc, page, order, migratetype)) {
__mod_zone_freepage_state(zone, -(1 << order),
migratetype);
@@ -2559,7 +2559,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
struct page *page;

/* Find a page of the appropriate size in the preferred list */
- for (current_order = order; current_order < MAX_ORDER; ++current_order) {
+ for (current_order = order; current_order <= MAX_ORDER; ++current_order) {
area = &(zone->free_area[current_order]);
page = get_page_from_free_area(area, migratetype);
if (!page)
@@ -2934,7 +2934,7 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
continue;

spin_lock_irqsave(&zone->lock, flags);
- for (order = 0; order < MAX_ORDER; order++) {
+ for (order = 0; order <= MAX_ORDER; order++) {
struct free_area *area = &(zone->free_area[order]);

page = get_page_from_free_area(area, MIGRATE_HIGHATOMIC);
@@ -3018,7 +3018,7 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype,
* approximates finding the pageblock with the most free pages, which
* would be too costly to do exactly.
*/
- for (current_order = MAX_ORDER - 1; current_order >= min_order;
+ for (current_order = MAX_ORDER; current_order >= min_order;
--current_order) {
area = &(zone->free_area[current_order]);
fallback_mt = find_suitable_fallback(area, current_order,
@@ -3044,7 +3044,7 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype,
return false;

find_smallest:
- for (current_order = order; current_order < MAX_ORDER;
+ for (current_order = order; current_order <= MAX_ORDER;
current_order++) {
area = &(zone->free_area[current_order]);
fallback_mt = find_suitable_fallback(area, current_order,
@@ -3057,7 +3057,7 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype,
* This should not happen - we already found a suitable fallback
* when looking for the largest page.
*/
- VM_BUG_ON(current_order == MAX_ORDER);
+ VM_BUG_ON(current_order == MAX_ORDER + 1);

do_steal:
page = get_page_from_free_area(area, fallback_mt);
@@ -4005,7 +4005,7 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
return true;

/* For a high-order request, check at least one suitable page is free */
- for (o = order; o < MAX_ORDER; o++) {
+ for (o = order; o <= MAX_ORDER; o++) {
struct free_area *area = &z->free_area[o];
int mt;

@@ -5480,7 +5480,7 @@ struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,
* There are several places where we assume that the order value is sane
* so bail out early if the request is out of bound.
*/
- if (WARN_ON_ONCE_GFP(order >= MAX_ORDER, gfp))
+ if (WARN_ON_ONCE_GFP(order > MAX_ORDER, gfp))
return NULL;

gfp &= gfp_allowed_mask;
@@ -6183,8 +6183,8 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)

for_each_populated_zone(zone) {
unsigned int order;
- unsigned long nr[MAX_ORDER], flags, total = 0;
- unsigned char types[MAX_ORDER];
+ unsigned long nr[MAX_ORDER + 1], flags, total = 0;
+ unsigned char types[MAX_ORDER + 1];

if (show_mem_node_skip(filter, zone_to_nid(zone), nodemask))
continue;
@@ -6192,7 +6192,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
printk(KERN_CONT "%s: ", zone->name);

spin_lock_irqsave(&zone->lock, flags);
- for (order = 0; order < MAX_ORDER; order++) {
+ for (order = 0; order <= MAX_ORDER; order++) {
struct free_area *area = &zone->free_area[order];
int type;

@@ -6206,7 +6206,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
}
}
spin_unlock_irqrestore(&zone->lock, flags);
- for (order = 0; order < MAX_ORDER; order++) {
+ for (order = 0; order <= MAX_ORDER; order++) {
printk(KERN_CONT "%lu*%lukB ",
nr[order], K(1UL) << order);
if (nr[order])
@@ -7545,7 +7545,7 @@ static inline void setup_usemap(struct zone *zone) {}
/* Initialise the number of pages represented by NR_PAGEBLOCK_BITS */
void __init set_pageblock_order(void)
{
- unsigned int order = MAX_ORDER - 1;
+ unsigned int order = MAX_ORDER;

/* Check that pageblock_nr_pages has not already been setup */
if (pageblock_order)
@@ -9051,7 +9051,7 @@ void *__init alloc_large_system_hash(const char *tablename,
else
table = memblock_alloc_raw(size,
SMP_CACHE_BYTES);
- } else if (get_order(size) >= MAX_ORDER || hashdist) {
+ } else if (get_order(size) > MAX_ORDER || hashdist) {
table = vmalloc_huge(size, gfp_flags);
virt = true;
if (table)
@@ -9265,7 +9265,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
order = 0;
outer_start = start;
while (!PageBuddy(pfn_to_page(outer_start))) {
- if (++order >= MAX_ORDER) {
+ if (++order > MAX_ORDER) {
outer_start = start;
break;
}
@@ -9524,7 +9524,7 @@ bool is_free_buddy_page(struct page *page)
unsigned long pfn = page_to_pfn(page);
unsigned int order;

- for (order = 0; order < MAX_ORDER; order++) {
+ for (order = 0; order <= MAX_ORDER; order++) {
struct page *page_head = page - (pfn & ((1 << order) - 1));

if (PageBuddy(page_head) &&
@@ -9532,7 +9532,7 @@ bool is_free_buddy_page(struct page *page)
break;
}

- return order < MAX_ORDER;
+ return order <= MAX_ORDER;
}
EXPORT_SYMBOL(is_free_buddy_page);

@@ -9583,7 +9583,7 @@ bool take_page_off_buddy(struct page *page)
bool ret = false;

spin_lock_irqsave(&zone->lock, flags);
- for (order = 0; order < MAX_ORDER; order++) {
+ for (order = 0; order <= MAX_ORDER; order++) {
struct page *page_head = page - (pfn & ((1 << order) - 1));
int page_order = buddy_order(page_head);

diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 9d73dc38e3d7..8d33120a81b2 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -226,7 +226,7 @@ static void unset_migratetype_isolate(struct page *page, int migratetype)
*/
if (PageBuddy(page)) {
order = buddy_order(page);
- if (order >= pageblock_order && order < MAX_ORDER - 1) {
+ if (order >= pageblock_order && order <= MAX_ORDER) {
buddy = find_buddy_page_pfn(page, page_to_pfn(page),
order, NULL);
if (buddy && !is_migrate_isolate_page(buddy)) {
@@ -289,11 +289,11 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
* @skip_isolation: the flag to skip the pageblock isolation in second
* isolate_single_pageblock()
*
- * Free and in-use pages can be as big as MAX_ORDER-1 and contain more than one
+ * Free and in-use pages can be as big as MAX_ORDER and contain more than one
* pageblock. When not all pageblocks within a page are isolated at the same
* time, free page accounting can go wrong. For example, in the case of
- * MAX_ORDER-1 = pageblock_order + 1, a MAX_ORDER-1 page has two pagelbocks.
- * [ MAX_ORDER-1 ]
+ * MAX_ORDER = pageblock_order + 1, a MAX_ORDER page has two pagelbocks.
+ * [ MAX_ORDER ]
* [ pageblock0 | pageblock1 ]
* When either pageblock is isolated, if it is a free page, the page is not
* split into separate migratetype lists, which is supposed to; if it is an
@@ -450,7 +450,7 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
* the free page to the right migratetype list.
*
* head_pfn is not used here as a hugetlb page order
- * can be bigger than MAX_ORDER-1, but after it is
+ * can be bigger than MAX_ORDER, but after it is
* freed, the free page order is not. Use pfn within
* the range to find the head of the free page.
*/
@@ -458,7 +458,7 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
outer_pfn = pfn;
while (!PageBuddy(pfn_to_page(outer_pfn))) {
/* stop if we cannot find the free page */
- if (++order >= MAX_ORDER)
+ if (++order > MAX_ORDER)
goto failed;
outer_pfn &= ~0UL << order;
}
@@ -639,7 +639,7 @@ int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn,
int ret;

/*
- * Note: pageblock_nr_pages != MAX_ORDER. Then, chunks of free pages
+ * Note: pageblock_order != MAX_ORDER. Then, chunks of free pages
* are not aligned to pageblock_nr_pages.
* Then we just check migratetype first.
*/
diff --git a/mm/page_owner.c b/mm/page_owner.c
index 223bbf8674ec..80cf367362c3 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -318,7 +318,7 @@ void pagetypeinfo_showmixedcount_print(struct seq_file *m,
unsigned long freepage_order;

freepage_order = buddy_order_unsafe(page);
- if (freepage_order < MAX_ORDER)
+ if (freepage_order <= MAX_ORDER)
pfn += (1UL << freepage_order) - 1;
continue;
}
@@ -552,7 +552,7 @@ read_page_owner(struct file *file, char __user *buf, size_t count, loff_t *ppos)
if (PageBuddy(page)) {
unsigned long freepage_order = buddy_order_unsafe(page);

- if (freepage_order < MAX_ORDER)
+ if (freepage_order <= MAX_ORDER)
pfn += (1UL << freepage_order) - 1;
continue;
}
@@ -645,7 +645,7 @@ static void init_pages_in_zone(pg_data_t *pgdat, struct zone *zone)
if (PageBuddy(page)) {
unsigned long order = buddy_order_unsafe(page);

- if (order > 0 && order < MAX_ORDER)
+ if (order > 0 && order <= MAX_ORDER)
pfn += (1UL << order) - 1;
continue;
}
diff --git a/mm/page_reporting.c b/mm/page_reporting.c
index 382958eef8a9..d52a55bca6d5 100644
--- a/mm/page_reporting.c
+++ b/mm/page_reporting.c
@@ -11,7 +11,7 @@
#include "page_reporting.h"
#include "internal.h"

-unsigned int page_reporting_order = MAX_ORDER;
+unsigned int page_reporting_order = MAX_ORDER + 1;
module_param(page_reporting_order, uint, 0644);
MODULE_PARM_DESC(page_reporting_order, "Set page reporting order");

@@ -244,7 +244,7 @@ page_reporting_process_zone(struct page_reporting_dev_info *prdev,
return err;

/* Process each free list starting from lowest order/mt */
- for (order = page_reporting_order; order < MAX_ORDER; order++) {
+ for (order = page_reporting_order; order <= MAX_ORDER; order++) {
for (mt = 0; mt < MIGRATE_TYPES; mt++) {
/* We do not pull pages from the isolate free list */
if (is_migrate_isolate(mt))
diff --git a/mm/shuffle.h b/mm/shuffle.h
index cec62984f7d3..a6bdf54f96f1 100644
--- a/mm/shuffle.h
+++ b/mm/shuffle.h
@@ -4,7 +4,7 @@
#define _MM_SHUFFLE_H
#include <linux/jump_label.h>

-#define SHUFFLE_ORDER (MAX_ORDER-1)
+#define SHUFFLE_ORDER MAX_ORDER

#ifdef CONFIG_SHUFFLE_PAGE_ALLOCATOR
DECLARE_STATIC_KEY_FALSE(page_alloc_shuffle_key);
diff --git a/mm/slab.c b/mm/slab.c
index 10e96137b44f..530f418a4930 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -466,7 +466,7 @@ static int __init slab_max_order_setup(char *str)
{
get_option(&str, &slab_max_order);
slab_max_order = slab_max_order < 0 ? 0 :
- min(slab_max_order, MAX_ORDER - 1);
+ min(slab_max_order, MAX_ORDER);
slab_max_order_set = true;

return 1;
diff --git a/mm/slub.c b/mm/slub.c
index 862dbd9af4f5..5acf5407cbc6 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3877,7 +3877,7 @@ static inline int calculate_order(unsigned int size)
* Doh this slab cannot be placed using slub_max_order.
*/
order = calc_slab_order(size, 1, MAX_ORDER, 1);
- if (order < MAX_ORDER)
+ if (order <= MAX_ORDER)
return order;
return -ENOSYS;
}
@@ -4388,7 +4388,7 @@ __setup("slub_min_order=", setup_slub_min_order);
static int __init setup_slub_max_order(char *str)
{
get_option(&str, (int *)&slub_max_order);
- slub_max_order = min(slub_max_order, (unsigned int)MAX_ORDER - 1);
+ slub_max_order = min_t(unsigned int, slub_max_order, MAX_ORDER);

return 1;
}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 90af9a8572f5..9fc206477fb7 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1068,7 +1068,7 @@ static void fill_contig_page_info(struct zone *zone,
info->free_blocks_total = 0;
info->free_blocks_suitable = 0;

- for (order = 0; order < MAX_ORDER; order++) {
+ for (order = 0; order <= MAX_ORDER; order++) {
unsigned long blocks;

/*
@@ -1101,7 +1101,7 @@ static int __fragmentation_index(unsigned int order, struct contig_page_info *in
{
unsigned long requested = 1UL << order;

- if (WARN_ON_ONCE(order >= MAX_ORDER))
+ if (WARN_ON_ONCE(order > MAX_ORDER))
return 0;

if (!info->free_blocks_total)
@@ -1474,7 +1474,7 @@ static void frag_show_print(struct seq_file *m, pg_data_t *pgdat,
int order;

seq_printf(m, "Node %d, zone %8s ", pgdat->node_id, zone->name);
- for (order = 0; order < MAX_ORDER; ++order)
+ for (order = 0; order <= MAX_ORDER; ++order)
/*
* Access to nr_free is lockless as nr_free is used only for
* printing purposes. Use data_race to avoid KCSAN warning.
@@ -1503,7 +1503,7 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
pgdat->node_id,
zone->name,
migratetype_names[mtype]);
- for (order = 0; order < MAX_ORDER; ++order) {
+ for (order = 0; order <= MAX_ORDER; ++order) {
unsigned long freecount = 0;
struct free_area *area;
struct list_head *curr;
@@ -1543,7 +1543,7 @@ static void pagetypeinfo_showfree(struct seq_file *m, void *arg)

/* Print header */
seq_printf(m, "%-43s ", "Free pages count per migrate type at order");
- for (order = 0; order < MAX_ORDER; ++order)
+ for (order = 0; order <= MAX_ORDER; ++order)
seq_printf(m, "%6d ", order);
seq_putc(m, '\n');

@@ -2168,7 +2168,7 @@ static void unusable_show_print(struct seq_file *m,
seq_printf(m, "Node %d, zone %8s ",
pgdat->node_id,
zone->name);
- for (order = 0; order < MAX_ORDER; ++order) {
+ for (order = 0; order <= MAX_ORDER; ++order) {
fill_contig_page_info(zone, order, &info);
index = unusable_free_index(order, &info);
seq_printf(m, "%d.%03d ", index / 1000, index % 1000);
@@ -2220,7 +2220,7 @@ static void extfrag_show_print(struct seq_file *m,
seq_printf(m, "Node %d, zone %8s ",
pgdat->node_id,
zone->name);
- for (order = 0; order < MAX_ORDER; ++order) {
+ for (order = 0; order <= MAX_ORDER; ++order) {
fill_contig_page_info(zone, order, &info);
index = __fragmentation_index(order, &info);
seq_printf(m, "%2d.%03d ", index / 1000, index % 1000);
diff --git a/net/smc/smc_ib.c b/net/smc/smc_ib.c
index 854772dd52fd..9b66d6aeeb1a 100644
--- a/net/smc/smc_ib.c
+++ b/net/smc/smc_ib.c
@@ -843,7 +843,7 @@ long smc_ib_setup_per_ibdev(struct smc_ib_device *smcibdev)
goto out;
/* the calculated number of cq entries fits to mlx5 cq allocation */
cqe_size_order = cache_line_size() == 128 ? 7 : 6;
- smc_order = MAX_ORDER - cqe_size_order - 1;
+ smc_order = MAX_ORDER - cqe_size_order;
if (SMC_MAX_CQE + 2 > (0x00000001 << smc_order) * PAGE_SIZE)
cqattr.cqe = (0x00000001 << smc_order) * PAGE_SIZE - 2;
smcibdev->roce_cq_send = ib_create_cq(smcibdev->ibdev,
diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index 79e759aac543..e736847ef3ac 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -7368,6 +7368,14 @@ sub process {
}
}

+# check for MAX_ORDER uses as its semantics has changed.
+# MAX_ORDER now really means the max order of a page that can come out of
+# kernel buddy allocator
+ if ($line =~ /MAX_ORDER/) {
+ WARN("MAX_ORDER",
+ "MAX_ORDER has changed its semantics. The max order of a page that can be allocated from buddy allocator is MAX_ORDER instead of MAX_ORDER - 1.")
+ }
+
# Mode permission misuses where it seems decimal should be octal
# This uses a shortcut match to avoid unnecessary uses of a slow foreach loop
# o Ignore module_param*(...) uses with a decimal 0 permission as that has a
diff --git a/security/integrity/ima/ima_crypto.c b/security/integrity/ima/ima_crypto.c
index 64499056648a..51ad29940f05 100644
--- a/security/integrity/ima/ima_crypto.c
+++ b/security/integrity/ima/ima_crypto.c
@@ -38,7 +38,7 @@ static int param_set_bufsize(const char *val, const struct kernel_param *kp)

size = memparse(val, NULL);
order = get_order(size);
- if (order >= MAX_ORDER)
+ if (order > MAX_ORDER)
return -EINVAL;
ima_maxorder = order;
ima_bufsize = PAGE_SIZE << order;
diff --git a/tools/testing/memblock/linux/mmzone.h b/tools/testing/memblock/linux/mmzone.h
index 7c2eb5c9bb54..d79748b263e7 100644
--- a/tools/testing/memblock/linux/mmzone.h
+++ b/tools/testing/memblock/linux/mmzone.h
@@ -17,10 +17,10 @@ enum zone_type {
};

#define MAX_NR_ZONES __MAX_NR_ZONES
-#define MAX_ORDER 11
-#define MAX_ORDER_NR_PAGES (1 << (MAX_ORDER - 1))
+#define MAX_ORDER 10
+#define MAX_ORDER_NR_PAGES (1 << MAX_ORDER)

-#define pageblock_order (MAX_ORDER - 1)
+#define pageblock_order MAX_ORDER
#define pageblock_nr_pages BIT(pageblock_order)

struct zone {
--
2.35.1

2022-08-11 23:21:26

by Zi Yan

[permalink] [raw]
Subject: [RFC PATCH v2 11/12] mm: introduce MIN_MAX_ORDER to replace MAX_ORDER as compile time constant.

From: Zi Yan <[email protected]>

For other MAX_ORDER uses (described below), there is no need or too much
hassle to convert certain static array to dynamic ones. Add
MIN_MAX_ORDER to serve as compile time constant in place of MAX_ORDER.

ARM64 hypervisor maintains its own free page list and does not import
any core kernel symbols, so soon-to-be runtime variable MAX_ORDER is not
accessible in ARM64 hypervisor code. Also there is no need to allocating
very large pages.

In SLAB/SLOB/SLUB, 2-D array kmalloc_caches uses MAX_ORDER in its second
dimension. It is too much hassle to allocate memory for kmalloc_caches
before any proper memory allocator is set up.

Signed-off-by: Zi Yan <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Quentin Perret <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
arch/arm64/kvm/hyp/include/nvhe/gfp.h | 2 +-
arch/arm64/kvm/hyp/nvhe/page_alloc.c | 2 +-
include/linux/mmzone.h | 3 +++
include/linux/slab.h | 8 ++++----
mm/slab.c | 2 +-
mm/slub.c | 6 +++---
6 files changed, 13 insertions(+), 10 deletions(-)

diff --git a/arch/arm64/kvm/hyp/include/nvhe/gfp.h b/arch/arm64/kvm/hyp/include/nvhe/gfp.h
index fe5472a184a3..29b92f68ab69 100644
--- a/arch/arm64/kvm/hyp/include/nvhe/gfp.h
+++ b/arch/arm64/kvm/hyp/include/nvhe/gfp.h
@@ -16,7 +16,7 @@ struct hyp_pool {
* API at EL2.
*/
hyp_spinlock_t lock;
- struct list_head free_area[MAX_ORDER + 1];
+ struct list_head free_area[MIN_MAX_ORDER + 1];
phys_addr_t range_start;
phys_addr_t range_end;
unsigned short max_order;
diff --git a/arch/arm64/kvm/hyp/nvhe/page_alloc.c b/arch/arm64/kvm/hyp/nvhe/page_alloc.c
index d40f0b30b534..7ebbac3e2e76 100644
--- a/arch/arm64/kvm/hyp/nvhe/page_alloc.c
+++ b/arch/arm64/kvm/hyp/nvhe/page_alloc.c
@@ -241,7 +241,7 @@ int hyp_pool_init(struct hyp_pool *pool, u64 pfn, unsigned int nr_pages,
int i;

hyp_spin_lock_init(&pool->lock);
- pool->max_order = min(MAX_ORDER, get_order((nr_pages + 1) << PAGE_SHIFT));
+ pool->max_order = min(MIN_MAX_ORDER, get_order((nr_pages + 1) << PAGE_SHIFT));
for (i = 0; i < pool->max_order; i++)
INIT_LIST_HEAD(&pool->free_area[i]);
pool->range_start = phys;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 60d8cce2aed8..b5774e4c2700 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -26,10 +26,13 @@
/* Free memory management - zoned buddy allocator. */
#ifdef CONFIG_SET_MAX_ORDER
#define MAX_ORDER CONFIG_SET_MAX_ORDER
+#define MIN_MAX_ORDER CONFIG_SET_MAX_ORDER
#elif CONFIG_ARCH_FORCE_MAX_ORDER != 0
#define MAX_ORDER CONFIG_ARCH_FORCE_MAX_ORDER
+#define MIN_MAX_ORDER CONFIG_ARCH_FORCE_MAX_ORDER
#else
#define MAX_ORDER 10
+#define MIN_MAX_ORDER MAX_ORDER
#endif

#define MAX_ORDER_NR_PAGES (1 << MAX_ORDER)
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 568b5dfb3bd9..e34b2c9bda09 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -251,8 +251,8 @@ static inline unsigned int arch_slab_minalign(void)
* to do various tricks to work around compiler limitations in order to
* ensure proper constant folding.
*/
-#define KMALLOC_SHIFT_HIGH ((MAX_ORDER + PAGE_SHIFT) <= 25 ? \
- (MAX_ORDER + PAGE_SHIFT) : 25)
+#define KMALLOC_SHIFT_HIGH ((MIN_MAX_ORDER + PAGE_SHIFT) <= 25 ? \
+ (MIN_MAX_ORDER + PAGE_SHIFT) : 25)
#define KMALLOC_SHIFT_MAX KMALLOC_SHIFT_HIGH
#ifndef KMALLOC_SHIFT_LOW
#define KMALLOC_SHIFT_LOW 5
@@ -265,7 +265,7 @@ static inline unsigned int arch_slab_minalign(void)
* (PAGE_SIZE*2). Larger requests are passed to the page allocator.
*/
#define KMALLOC_SHIFT_HIGH (PAGE_SHIFT + 1)
-#define KMALLOC_SHIFT_MAX (MAX_ORDER + PAGE_SHIFT)
+#define KMALLOC_SHIFT_MAX (MIN_MAX_ORDER + PAGE_SHIFT)
#ifndef KMALLOC_SHIFT_LOW
#define KMALLOC_SHIFT_LOW 3
#endif
@@ -278,7 +278,7 @@ static inline unsigned int arch_slab_minalign(void)
* be allocated from the same page.
*/
#define KMALLOC_SHIFT_HIGH PAGE_SHIFT
-#define KMALLOC_SHIFT_MAX (MAX_ORDER + PAGE_SHIFT)
+#define KMALLOC_SHIFT_MAX (MIN_MAX_ORDER + PAGE_SHIFT)
#ifndef KMALLOC_SHIFT_LOW
#define KMALLOC_SHIFT_LOW 3
#endif
diff --git a/mm/slab.c b/mm/slab.c
index 530f418a4930..23798c32bb38 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -466,7 +466,7 @@ static int __init slab_max_order_setup(char *str)
{
get_option(&str, &slab_max_order);
slab_max_order = slab_max_order < 0 ? 0 :
- min(slab_max_order, MAX_ORDER);
+ min(slab_max_order, MIN_MAX_ORDER);
slab_max_order_set = true;

return 1;
diff --git a/mm/slub.c b/mm/slub.c
index 5acf5407cbc6..940fe48ea298 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3876,8 +3876,8 @@ static inline int calculate_order(unsigned int size)
/*
* Doh this slab cannot be placed using slub_max_order.
*/
- order = calc_slab_order(size, 1, MAX_ORDER, 1);
- if (order <= MAX_ORDER)
+ order = calc_slab_order(size, 1, MIN_MAX_ORDER, 1);
+ if (order <= MIN_MAX_ORDER)
return order;
return -ENOSYS;
}
@@ -4388,7 +4388,7 @@ __setup("slub_min_order=", setup_slub_min_order);
static int __init setup_slub_max_order(char *str)
{
get_option(&str, (int *)&slub_max_order);
- slub_max_order = min_t(unsigned int, slub_max_order, MAX_ORDER);
+ slub_max_order = min_t(unsigned int, slub_max_order, MIN_MAX_ORDER);

return 1;
}
--
2.35.1

2022-08-11 23:23:03

by Zi Yan

[permalink] [raw]
Subject: [RFC PATCH v2 12/12] mm: make MAX_ORDER a kernel boot time parameter.

From: Zi Yan <[email protected]>

With the new buddy_alloc_max_order, users can specify larger MAX_ORDER
than set in CONFIG_ARCH_MAX_ORDER or CONFIG_SET_MAX_ORDER.
It can be set any value >= CONFIG_ARCH_MAX_ORDER or CONFIG_SET_MAX_ORDER,
but < 256 (limited by vmscan scan_control and per-cpu free page list).

Signed-off-by: Zi Yan <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
.../admin-guide/kernel-parameters.txt | 5 +++
include/linux/mmzone.h | 8 +++++
mm/Kconfig | 13 +++++++
mm/page_alloc.c | 34 ++++++++++++++++++-
mm/vmscan.c | 1 -
5 files changed, 59 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index ec519225b671..0f71233ae396 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -494,6 +494,11 @@
bttv.pll= See Documentation/admin-guide/media/bttv.rst
bttv.tuner=

+ buddy_alloc_max_order= [KNL] This parameter adjusts the size of largest
+ pages that can be allocated from kernel buddy allocator. The largest
+ page size is 2^buddy_alloc_max_order * PAGE_SIZE.
+ Format: integer
+
bulk_remove=off [PPC] This parameter disables the use of the pSeries
firmware feature for flushing multiple hpte entries
at a time.
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b5774e4c2700..90121d25d660 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -35,6 +35,14 @@
#define MIN_MAX_ORDER MAX_ORDER
#endif

+/* remap MAX_ORDER to buddy_alloc_max_order for boot time adjustment */
+#ifdef CONFIG_BOOT_TIME_MAX_ORDER
+/* Defined in mm/page_alloc.c */
+extern int buddy_alloc_max_order;
+#undef MAX_ORDER
+#define MAX_ORDER buddy_alloc_max_order
+#endif /* CONFIG_BOOT_TIME_MAX_ORDER */
+
#define MAX_ORDER_NR_PAGES (1 << MAX_ORDER)

/*
diff --git a/mm/Kconfig b/mm/Kconfig
index e558f5679707..acccb919d72d 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -455,6 +455,19 @@ config SET_MAX_ORDER
increase this value. A value of 10 means that the largest free memory
block is 2^10 pages.

+config BOOT_TIME_MAX_ORDER
+ bool "Set maximum order of buddy allocator at boot time"
+ depends on SPARSEMEM_VMEMMAP && (ARCH_FORCE_MAX_ORDER != 0 || SET_MAX_ORDER != 0)
+ help
+ It enables users to set the maximum order of buddy allocator at system
+ boot time instead of a static MACRO set at compilation time. Systems with
+ a lot of memory might want to allocate large pages whereas it is much
+ less feasible and desirable for systems with less memory. This option
+ allows different systems to control the largest page they want to
+ allocate. By default, MAX_ORDER will be set to ARCH_FORCE_MAX_ORDER or
+ SET_MAX_ORDER, whichever is non-zero, when the boot time parameter is not
+ set. The maximum of MAX_ORDER is currently limited at 256.
+
config HAVE_MEMBLOCK_PHYS_MAP
bool

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 941a94bb8cf0..4c4d68da1922 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1581,7 +1581,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,

order = pindex_to_order(pindex);
nr_pages = 1 << order;
- BUILD_BUG_ON(MAX_ORDER >= (1<<NR_PCP_ORDER_WIDTH));
+ BUILD_BUG_ON(MIN_MAX_ORDER >= (1<<NR_PCP_ORDER_WIDTH));
do {
int mt;

@@ -9679,3 +9679,35 @@ bool has_managed_dma(void)
return false;
}
#endif /* CONFIG_ZONE_DMA */
+
+#ifdef CONFIG_BOOT_TIME_MAX_ORDER
+int buddy_alloc_max_order = MIN_MAX_ORDER;
+EXPORT_SYMBOL(buddy_alloc_max_order);
+
+static int __init buddy_alloc_set(char *val)
+{
+ int ret;
+ unsigned long max_order;
+
+ ret = kstrtoul(val, 10, &max_order);
+
+ if (ret < 0)
+ return -EINVAL;
+
+ /*
+ * max_order is also limited at below locations:
+ * 1. scan_control in mm/vmscan.c uses s8 field for order, max_order cannot
+ * be bigger than S8_MAX before the field is changed.
+ * 2. free_pcppages_bulk has max_order upper limit.
+ */
+ if (max_order > MIN_MAX_ORDER && max_order <= S8_MAX &&
+ max_order <= (1<<NR_PCP_ORDER_WIDTH))
+ buddy_alloc_max_order = max_order;
+ else
+ buddy_alloc_max_order = MIN_MAX_ORDER;
+
+ return 0;
+}
+
+early_param("buddy_alloc_max_order", buddy_alloc_set);
+#endif /* CONFIG_BOOT_TIME_MAX_ORDER */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 06eeeae038dd..9d4fde8705d9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3816,7 +3816,6 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
* scan_control uses s8 fields for order, priority, and reclaim_idx.
* Confirm they are large enough for max values.
*/
- BUILD_BUG_ON(MAX_ORDER > S8_MAX);
BUILD_BUG_ON(DEF_PRIORITY > S8_MAX);
BUILD_BUG_ON(MAX_NR_ZONES > S8_MAX);

--
2.35.1

2022-08-11 23:30:11

by Zi Yan

[permalink] [raw]
Subject: [RFC PATCH v2 07/12] virtio: virtio_balloon: use pageblock_order instead of MAX_ORDER

From: Zi Yan <[email protected]>

virtio_balloon used MAX_ORDER to report free page blocks to host, as
MAX_ORDER becomes modifiable in later commits, the reported free size might
be too big. pageblock_order is either 1/2 of or the same as MAX_ORDER
currently. Use pageblock_order instead to make virtio_balloon have a
constant free page block report size when MAX_ORDER is changed in the later
commits.

Signed-off-by: Zi Yan <[email protected]>
Cc: "Michael S. Tsirkin" <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Jason Wang <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
drivers/virtio/virtio_balloon.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 5b15936a5214..51447737538b 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -33,7 +33,7 @@
#define VIRTIO_BALLOON_FREE_PAGE_ALLOC_FLAG (__GFP_NORETRY | __GFP_NOWARN | \
__GFP_NOMEMALLOC)
/* The order of free page blocks to report to host */
-#define VIRTIO_BALLOON_HINT_BLOCK_ORDER MAX_ORDER
+#define VIRTIO_BALLOON_HINT_BLOCK_ORDER pageblock_order
/* The size of a free page block in bytes */
#define VIRTIO_BALLOON_HINT_BLOCK_BYTES \
(1 << (VIRTIO_BALLOON_HINT_BLOCK_ORDER + PAGE_SHIFT))
--
2.35.1

2022-08-11 23:45:33

by Zi Yan

[permalink] [raw]
Subject: [RFC PATCH v2 03/12] mm: replace MAX_ORDER when it is used to indicate max physical contiguity.

From: Zi Yan <[email protected]>

MAX_ORDER is limited at a memory section size, thus widely used as
a variable to indicate maximum physically contiguous page size. But this
limitation is no longer necessary as kernel only supports sparse memory
model. Add a new variable MAX_PHYS_CONTIG_ORDER to replace such uses of
MAX_ORDER.

Signed-off-by: Zi Yan <[email protected]>
---
Documentation/admin-guide/kernel-parameters.txt | 2 +-
arch/sparc/mm/tsb.c | 4 ++--
arch/um/kernel/um_arch.c | 4 ++--
include/linux/pageblock-flags.h | 12 ++++++++++++
kernel/dma/pool.c | 8 ++++----
mm/hugetlb.c | 2 +-
mm/internal.h | 8 ++++----
mm/memory.c | 4 ++--
mm/memory_hotplug.c | 6 +++---
mm/page_isolation.c | 2 +-
mm/page_reporting.c | 4 ++--
11 files changed, 34 insertions(+), 22 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index ff33971e1630..ec519225b671 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3899,7 +3899,7 @@
[KNL] Minimal page reporting order
Format: <integer>
Adjust the minimal page reporting order. The page
- reporting is disabled when it exceeds MAX_ORDER.
+ reporting is disabled when it exceeds MAX_PHYS_CONTIG_ORDER.

panic= [KNL] Kernel behaviour on panic: delay <timeout>
timeout > 0: seconds before rebooting
diff --git a/arch/sparc/mm/tsb.c b/arch/sparc/mm/tsb.c
index 912205787161..15c31d050dab 100644
--- a/arch/sparc/mm/tsb.c
+++ b/arch/sparc/mm/tsb.c
@@ -402,8 +402,8 @@ void tsb_grow(struct mm_struct *mm, unsigned long tsb_index, unsigned long rss)
unsigned long new_rss_limit;
gfp_t gfp_flags;

- if (max_tsb_size > (PAGE_SIZE << MAX_ORDER))
- max_tsb_size = (PAGE_SIZE << MAX_ORDER);
+ if (max_tsb_size > (PAGE_SIZE << MAX_PHYS_CONTIG_ORDER))
+ max_tsb_size = (PAGE_SIZE << MAX_PHYS_CONTIG_ORDER);

new_cache_index = 0;
for (new_size = 8192; new_size < max_tsb_size; new_size <<= 1UL) {
diff --git a/arch/um/kernel/um_arch.c b/arch/um/kernel/um_arch.c
index e0de60e503b9..52a474f4f1c7 100644
--- a/arch/um/kernel/um_arch.c
+++ b/arch/um/kernel/um_arch.c
@@ -368,10 +368,10 @@ int __init linux_main(int argc, char **argv)
max_physmem = TASK_SIZE - uml_physmem - iomem_size - MIN_VMALLOC;

/*
- * Zones have to begin on a 1 << MAX_ORDER page boundary,
+ * Zones have to begin on a 1 << MAX_PHYS_CONTIG_ORDER page boundary,
* so this makes sure that's true for highmem
*/
- max_physmem &= ~((1 << (PAGE_SHIFT + MAX_ORDER)) - 1);
+ max_physmem &= ~((1 << (PAGE_SHIFT + MAX_PHYS_CONTIG_ORDER)) - 1);
if (physmem_size + iomem_size > max_physmem) {
highmem = physmem_size + iomem_size - max_physmem;
physmem_size -= highmem;
diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
index 940efcffd374..358b871b07ca 100644
--- a/include/linux/pageblock-flags.h
+++ b/include/linux/pageblock-flags.h
@@ -54,6 +54,18 @@ extern unsigned int pageblock_order;

#define pageblock_nr_pages (1UL << pageblock_order)

+/*
+ * memory section is only defined in sparsemem and in flatmem, pages are always
+ * physically contiguous, but we use MAX_ORDER since all users assume so.
+ */
+#ifdef CONFIG_FLATMEM
+#define MAX_PHYS_CONTIG_ORDER MAX_ORDER
+#else /* SPARSEMEM */
+#define MAX_PHYS_CONTIG_ORDER (min(PFN_SECTION_SHIFT, MAX_ORDER))
+#endif /* CONFIG_FLATMEM */
+
+#define MAX_PHYS_CONTIG_NR_PAGES (1UL << MAX_PHYS_CONTIG_ORDER)
+
/* Forward declaration */
struct page;

diff --git a/kernel/dma/pool.c b/kernel/dma/pool.c
index e20f168a34c7..b10f1dd52871 100644
--- a/kernel/dma/pool.c
+++ b/kernel/dma/pool.c
@@ -84,8 +84,8 @@ static int atomic_pool_expand(struct gen_pool *pool, size_t pool_size,
void *addr;
int ret = -ENOMEM;

- /* Cannot allocate larger than MAX_ORDER */
- order = min(get_order(pool_size), MAX_ORDER);
+ /* Cannot allocate larger than MAX_PHYS_CONTIG_ORDER */
+ order = min(get_order(pool_size), MAX_PHYS_CONTIG_ORDER);

do {
pool_size = 1 << (PAGE_SHIFT + order);
@@ -190,11 +190,11 @@ static int __init dma_atomic_pool_init(void)

/*
* If coherent_pool was not used on the command line, default the pool
- * sizes to 128KB per 1GB of memory, min 128KB, max MAX_ORDER.
+ * sizes to 128KB per 1GB of memory, min 128KB, max MAX_PHYS_CONTIG_ORDER.
*/
if (!atomic_pool_size) {
unsigned long pages = totalram_pages() / (SZ_1G / SZ_128K);
- pages = min_t(unsigned long, pages, MAX_ORDER_NR_PAGES);
+ pages = min_t(unsigned long, pages, MAX_PHYS_CONTIG_NR_PAGES);
atomic_pool_size = max_t(size_t, pages << PAGE_SHIFT, SZ_128K);
}
INIT_WORK(&atomic_pool_work, atomic_pool_work_fn);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 15ff582687a3..36eedeed1b22 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1903,7 +1903,7 @@ pgoff_t hugetlb_basepage_index(struct page *page)
pgoff_t index = page_index(page_head);
unsigned long compound_idx;

- if (compound_order(page_head) > MAX_ORDER)
+ if (compound_order(page_head) > MAX_PHYS_CONTIG_ORDER)
compound_idx = page_to_pfn(page) - page_to_pfn(page_head);
else
compound_idx = page - page_head;
diff --git a/mm/internal.h b/mm/internal.h
index 4df67b6b8cce..1433e3a6fdd0 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -302,7 +302,7 @@ static inline bool page_is_buddy(struct page *page, struct page *buddy,
* satisfies the following equation:
* P = B & ~(1 << O)
*
- * Assumption: *_mem_map is contiguous at least up to MAX_ORDER
+ * Assumption: *_mem_map is contiguous at least up to MAX_PHYS_CONTIG_ORDER
*/
static inline unsigned long
__find_buddy_pfn(unsigned long page_pfn, unsigned int order)
@@ -642,11 +642,11 @@ static inline void vunmap_range_noflush(unsigned long start, unsigned long end)
/*
* Return the mem_map entry representing the 'offset' subpage within
* the maximally aligned gigantic page 'base'. Handle any discontiguity
- * in the mem_map at MAX_ORDER_NR_PAGES boundaries.
+ * in the mem_map at MAX_PHYS_CONTIG_NR_PAGES boundaries.
*/
static inline struct page *mem_map_offset(struct page *base, int offset)
{
- if (unlikely(offset >= MAX_ORDER_NR_PAGES))
+ if (unlikely(offset >= MAX_PHYS_CONTIG_NR_PAGES))
return nth_page(base, offset);
return base + offset;
}
@@ -658,7 +658,7 @@ static inline struct page *mem_map_offset(struct page *base, int offset)
static inline struct page *mem_map_next(struct page *iter,
struct page *base, int offset)
{
- if (unlikely((offset & (MAX_ORDER_NR_PAGES - 1)) == 0)) {
+ if (unlikely((offset & (MAX_PHYS_CONTIG_NR_PAGES - 1)) == 0)) {
unsigned long pfn = page_to_pfn(base) + offset;
if (!pfn_valid(pfn))
return NULL;
diff --git a/mm/memory.c b/mm/memory.c
index bd8e7e79be99..3b82945aaa3d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5660,7 +5660,7 @@ void clear_huge_page(struct page *page,
unsigned long addr = addr_hint &
~(((unsigned long)pages_per_huge_page << PAGE_SHIFT) - 1);

- if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) {
+ if (unlikely(pages_per_huge_page > MAX_PHYS_CONTIG_NR_PAGES)) {
clear_gigantic_page(page, addr, pages_per_huge_page);
return;
}
@@ -5713,7 +5713,7 @@ void copy_user_huge_page(struct page *dst, struct page *src,
.vma = vma,
};

- if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) {
+ if (unlikely(pages_per_huge_page > MAX_PHYS_CONTIG_NR_PAGES)) {
copy_user_gigantic_page(dst, src, addr, vma,
pages_per_huge_page);
return;
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 5540499007ae..8930823e5067 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -596,16 +596,16 @@ static void online_pages_range(unsigned long start_pfn, unsigned long nr_pages)
unsigned long pfn;

/*
- * Online the pages in MAX_ORDER aligned chunks. The callback might
+ * Online the pages in MAX_PHYS_CONTIG_ORDER aligned chunks. The callback might
* decide to not expose all pages to the buddy (e.g., expose them
* later). We account all pages as being online and belonging to this
* zone ("present").
* When using memmap_on_memory, the range might not be aligned to
- * MAX_ORDER_NR_PAGES - 1, but pageblock aligned. __ffs() will detect
+ * MAX_PHYS_CONTIG_NR_PAGES - 1, but pageblock aligned. __ffs() will detect
* this and the first chunk to online will be pageblock_nr_pages.
*/
for (pfn = start_pfn; pfn < end_pfn;) {
- int order = min_t(unsigned long, MAX_ORDER, __ffs(pfn));
+ int order = min_t(unsigned long, MAX_PHYS_CONTIG_ORDER, __ffs(pfn));

(*online_page_callback)(pfn_to_page(pfn), order);
pfn += (1UL << order);
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 8d33120a81b2..801835f91c44 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -226,7 +226,7 @@ static void unset_migratetype_isolate(struct page *page, int migratetype)
*/
if (PageBuddy(page)) {
order = buddy_order(page);
- if (order >= pageblock_order && order <= MAX_ORDER) {
+ if (order >= pageblock_order && order <= MAX_PHYS_CONTIG_ORDER) {
buddy = find_buddy_page_pfn(page, page_to_pfn(page),
order, NULL);
if (buddy && !is_migrate_isolate_page(buddy)) {
diff --git a/mm/page_reporting.c b/mm/page_reporting.c
index d52a55bca6d5..b48d6ad82998 100644
--- a/mm/page_reporting.c
+++ b/mm/page_reporting.c
@@ -11,7 +11,7 @@
#include "page_reporting.h"
#include "internal.h"

-unsigned int page_reporting_order = MAX_ORDER + 1;
+unsigned int page_reporting_order = MAX_PHYS_CONTIG_ORDER + 1;
module_param(page_reporting_order, uint, 0644);
MODULE_PARM_DESC(page_reporting_order, "Set page reporting order");

@@ -244,7 +244,7 @@ page_reporting_process_zone(struct page_reporting_dev_info *prdev,
return err;

/* Process each free list starting from lowest order/mt */
- for (order = page_reporting_order; order <= MAX_ORDER; order++) {
+ for (order = page_reporting_order; order <= MAX_PHYS_CONTIG_ORDER; order++) {
for (mt = 0; mt < MIGRATE_TYPES; mt++) {
/* We do not pull pages from the isolate free list */
if (is_migrate_isolate(mt))
--
2.35.1

2022-08-11 23:46:37

by Zi Yan

[permalink] [raw]
Subject: [RFC PATCH v2 09/12] mm: Make MAX_ORDER of buddy allocator configurable via Kconfig SET_MAX_ORDER.

From: Zi Yan <[email protected]>

With SPARSEMEM_VMEMMAP, all struct page are virtually contigous,
thus kernel can manipulate arbitrarily large pages. By checking
PFN validity during buddy page merging process, all free pages in buddy
allocator's free area have their PFNs contiguous even if the system has
several not physically contiguous memory sections. With these two
conditions, it is OK to remove the restriction of
MAX_ORDER + PAGE_SHIFT < SECTION_SIZE_BITS and change MAX_ORDER freely.

Add SET_MAX_ORDER to allow MAX_ORDER adjustment when arch does not set
its own MAX_ORDER via ARCH_FORCE_MAX_ORDER. Make it depend
on SPARSEMEM_VMEMMAP, when MAX_ORDER is not limited by SECTION_SIZE_BITS.

Signed-off-by: Zi Yan <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
arch/Kconfig | 4 ++++
include/linux/mmzone.h | 17 ++++++++++++++---
mm/Kconfig | 14 ++++++++++++++
3 files changed, 32 insertions(+), 3 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index f330410da63a..24baee6c3feb 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -11,6 +11,10 @@ source "arch/$(SRCARCH)/Kconfig"

menu "General architecture-dependent options"

+config ARCH_FORCE_MAX_ORDER
+ int
+ default "0"
+
config CRASH_CORE
bool

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index e93faa3d7f1d..b83b481e250b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -24,11 +24,14 @@
#include <asm/page.h>

/* Free memory management - zoned buddy allocator. */
-#ifndef CONFIG_ARCH_FORCE_MAX_ORDER
-#define MAX_ORDER 10
-#else
+#ifdef CONFIG_SET_MAX_ORDER
+#define MAX_ORDER CONFIG_SET_MAX_ORDER
+#elif CONFIG_ARCH_FORCE_MAX_ORDER != 0
#define MAX_ORDER CONFIG_ARCH_FORCE_MAX_ORDER
+#else
+#define MAX_ORDER 10
#endif
+
#define MAX_ORDER_NR_PAGES (1 << MAX_ORDER)

/*
@@ -1379,9 +1382,17 @@ static inline bool movable_only_nodes(nodemask_t *nodes)
#define SECTION_BLOCKFLAGS_BITS \
((1UL << (PFN_SECTION_SHIFT - pageblock_order)) * NR_PAGEBLOCK_BITS)

+/*
+ * The MAX_ORDER check is not necessary when CONFIG_SET_MAX_ORDER is set, since
+ * it depends on CONFIG_SPARSEMEM_VMEMMAP, where all struct page are virtually
+ * contiguous, thus > section size pages can be allocated and manipulated
+ * without worrying about non-contiguous struct page.
+ */
+#ifndef CONFIG_SET_MAX_ORDER
#if (MAX_ORDER + PAGE_SHIFT) > SECTION_SIZE_BITS
#error Allocator MAX_ORDER exceeds SECTION_SIZE
#endif
+#endif /* CONFIG_SET_MAX_ORDER*/

static inline unsigned long pfn_to_section_nr(unsigned long pfn)
{
diff --git a/mm/Kconfig b/mm/Kconfig
index bbe31e85afee..e558f5679707 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -441,6 +441,20 @@ config SPARSEMEM_VMEMMAP
pfn_to_page and page_to_pfn operations. This is the most
efficient option when sufficient kernel resources are available.

+config SET_MAX_ORDER
+ int "Set maximum order of buddy allocator"
+ depends on SPARSEMEM_VMEMMAP && (ARCH_FORCE_MAX_ORDER = 0)
+ range 10 255
+ default "10"
+ help
+ The kernel memory allocator divides physically contiguous memory
+ blocks into "zones", where each zone is a power of two number of
+ pages. This option selects the largest power of two that the kernel
+ keeps in the memory allocator. If you need to allocate very large
+ blocks of physically contiguous memory, then you may need to
+ increase this value. A value of 10 means that the largest free memory
+ block is 2^10 pages.
+
config HAVE_MEMBLOCK_PHYS_MAP
bool

--
2.35.1

2022-08-11 23:48:26

by Zi Yan

[permalink] [raw]
Subject: [RFC PATCH v2 08/12] mm/page_reporting: set page_reporting_order to -1 to prevent it running

From: Zi Yan <[email protected]>

page_reporting_order was initialized to MAX_ORDER to prevent it running
before its value is overwritten. Use -1 instead to remove the
dependency on MAX_ORDER.

Signed-off-by: Zi Yan <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
mm/page_reporting.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/page_reporting.c b/mm/page_reporting.c
index b48d6ad82998..001438f3dbeb 100644
--- a/mm/page_reporting.c
+++ b/mm/page_reporting.c
@@ -11,7 +11,11 @@
#include "page_reporting.h"
#include "internal.h"

-unsigned int page_reporting_order = MAX_PHYS_CONTIG_ORDER + 1;
+/*
+ * Set page_reporting_order to (unsigned int)-1 to prevent it running until the
+ * value is being overwritten
+ */
+unsigned int page_reporting_order = (unsigned int)-1;
module_param(page_reporting_order, uint, 0644);
MODULE_PARM_DESC(page_reporting_order, "Set page reporting order");

--
2.35.1

2022-08-11 23:50:50

by Zi Yan

[permalink] [raw]
Subject: [RFC PATCH v2 05/12] mm: prevent pageblock size being larger than section size.

From: Zi Yan <[email protected]>

Only physical pages from a section can be guaranteed to be contiguous
and so far a pageblock can only group contiguous physical pages by
design. Set pageblock_order properly to prevent pageblock going beyond
section size.

Signed-off-by: Zi Yan <[email protected]>
Cc: Wei Yang <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: "Matthew Wilcox (Oracle)" <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
include/linux/pageblock-flags.h | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
index 358b871b07ca..2679b2b4c079 100644
--- a/include/linux/pageblock-flags.h
+++ b/include/linux/pageblock-flags.h
@@ -47,8 +47,11 @@ extern unsigned int pageblock_order;

#else /* CONFIG_HUGETLB_PAGE */

-/* If huge pages are not used, group by MAX_ORDER_NR_PAGES */
-#define pageblock_order MAX_ORDER
+/*
+ * If huge pages are not used, group by MAX_ORDER_NR_PAGES or
+ * PAGES_PER_SECTION when MAX_ORDER_NR_PAGES is larger.
+ */
+#define pageblock_order (min(PFN_SECTION_SHIFT, MAX_ORDER))

#endif /* CONFIG_HUGETLB_PAGE */

--
2.35.1

2022-08-11 23:50:58

by Zi Yan

[permalink] [raw]
Subject: [RFC PATCH v2 10/12] mm: convert MAX_ORDER sized static arrays to dynamic ones.

From: Zi Yan <[email protected]>

This prepares for the upcoming changes to make MAX_ORDER a boot time
parameter instead of compilation time constant. All static arrays with
MAX_ORDER size are converted to pointers and their memory is allocated
at runtime.

free_area array in struct zone is allocated using memblock_alloc_node()
at boot time and using kzalloc() when memory is hot-added.

Signed-off-by: Zi Yan <[email protected]>
Cc: Dave Young <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Christian Koenig <[email protected]>
Cc: David Airlie <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
.../admin-guide/kdump/vmcoreinfo.rst | 2 +-
drivers/gpu/drm/ttm/ttm_device.c | 7 ++-
drivers/gpu/drm/ttm/ttm_pool.c | 58 +++++++++++++++++--
include/drm/ttm/ttm_pool.h | 4 +-
include/linux/mmzone.h | 2 +-
mm/page_alloc.c | 32 ++++++++--
6 files changed, 87 insertions(+), 18 deletions(-)

diff --git a/Documentation/admin-guide/kdump/vmcoreinfo.rst b/Documentation/admin-guide/kdump/vmcoreinfo.rst
index c572b5230fe0..a775462aa7c7 100644
--- a/Documentation/admin-guide/kdump/vmcoreinfo.rst
+++ b/Documentation/admin-guide/kdump/vmcoreinfo.rst
@@ -172,7 +172,7 @@ variables.
Offset of the free_list's member. This value is used to compute the number
of free pages.

-Each zone has a free_area structure array called free_area[MAX_ORDER + 1].
+Each zone has a free_area structure array called free_area with length of MAX_ORDER + 1.
The free_list represents a linked list of free page blocks.

(list_head, next|prev)
diff --git a/drivers/gpu/drm/ttm/ttm_device.c b/drivers/gpu/drm/ttm/ttm_device.c
index e7147e304637..442a77bb5b4f 100644
--- a/drivers/gpu/drm/ttm/ttm_device.c
+++ b/drivers/gpu/drm/ttm/ttm_device.c
@@ -92,7 +92,9 @@ static int ttm_global_init(void)
>> PAGE_SHIFT;
num_dma32 = min(num_dma32, 2UL << (30 - PAGE_SHIFT));

- ttm_pool_mgr_init(num_pages);
+ ret = ttm_pool_mgr_init(num_pages);
+ if (ret)
+ goto out;
ttm_tt_mgr_init(num_pages, num_dma32);

glob->dummy_read_page = alloc_page(__GFP_ZERO | GFP_DMA32);
@@ -218,7 +220,8 @@ int ttm_device_init(struct ttm_device *bdev, struct ttm_device_funcs *funcs,
bdev->funcs = funcs;

ttm_sys_man_init(bdev);
- ttm_pool_init(&bdev->pool, dev, use_dma_alloc, use_dma32);
+ if (ttm_pool_init(&bdev->pool, dev, use_dma_alloc, use_dma32))
+ return -ENOMEM;

bdev->vma_manager = vma_manager;
INIT_DELAYED_WORK(&bdev->wq, ttm_device_delayed_workqueue);
diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
index 85d19f425af6..d76f7d476421 100644
--- a/drivers/gpu/drm/ttm/ttm_pool.c
+++ b/drivers/gpu/drm/ttm/ttm_pool.c
@@ -64,11 +64,11 @@ module_param(page_pool_size, ulong, 0644);

static atomic_long_t allocated_pages;

-static struct ttm_pool_type global_write_combined[MAX_ORDER + 1];
-static struct ttm_pool_type global_uncached[MAX_ORDER + 1];
+static struct ttm_pool_type *global_write_combined;
+static struct ttm_pool_type *global_uncached;

-static struct ttm_pool_type global_dma32_write_combined[MAX_ORDER + 1];
-static struct ttm_pool_type global_dma32_uncached[MAX_ORDER + 1];
+static struct ttm_pool_type *global_dma32_write_combined;
+static struct ttm_pool_type *global_dma32_uncached;

static spinlock_t shrinker_lock;
static struct list_head shrinker_list;
@@ -493,8 +493,10 @@ EXPORT_SYMBOL(ttm_pool_free);
* @use_dma32: true if GFP_DMA32 should be used
*
* Initialize the pool and its pool types.
+ *
+ * Returns: 0 on successe, negative error code otherwise
*/
-void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
+int ttm_pool_init(struct ttm_pool *pool, struct device *dev,
bool use_dma_alloc, bool use_dma32)
{
unsigned int i, j;
@@ -506,11 +508,30 @@ void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
pool->use_dma32 = use_dma32;

if (use_dma_alloc) {
- for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i)
+ for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i) {
+ pool->caching[i].orders =
+ kvcalloc(MAX_ORDER + 1, sizeof(struct ttm_pool_type),
+ GFP_KERNEL);
+ if (!pool->caching[i].orders) {
+ i--;
+ goto failed;
+ }
for (j = 0; j <= MAX_ORDER; ++j)
ttm_pool_type_init(&pool->caching[i].orders[j],
pool, i, j);
+
+ }
+ return 0;
+
+failed:
+ for (; i >= 0; i--) {
+ for (j = 0; j <= MAX_ORDER; ++j)
+ ttm_pool_type_fini(&pool->caching[i].orders[j]);
+ kfree(pool->caching[i].orders);
+ }
+ return -ENOMEM;
}
+ return 0;
}

/**
@@ -701,6 +722,31 @@ int ttm_pool_mgr_init(unsigned long num_pages)
spin_lock_init(&shrinker_lock);
INIT_LIST_HEAD(&shrinker_list);

+ if (!global_write_combined) {
+ global_write_combined = kvcalloc(MAX_ORDER + 1, sizeof(struct ttm_pool_type),
+ GFP_KERNEL);
+ if (!global_write_combined)
+ return -ENOMEM;
+ }
+ if (!global_uncached) {
+ global_uncached = kvcalloc(MAX_ORDER + 1, sizeof(struct ttm_pool_type),
+ GFP_KERNEL);
+ if (!global_uncached)
+ return -ENOMEM;
+ }
+ if (!global_dma32_write_combined) {
+ global_dma32_write_combined = kvcalloc(MAX_ORDER + 1, sizeof(struct ttm_pool_type),
+ GFP_KERNEL);
+ if (!global_dma32_write_combined)
+ return -ENOMEM;
+ }
+ if (!global_dma32_uncached) {
+ global_dma32_uncached = kvcalloc(MAX_ORDER + 1, sizeof(struct ttm_pool_type),
+ GFP_KERNEL);
+ if (!global_dma32_uncached)
+ return -ENOMEM;
+ }
+
for (i = 0; i <= MAX_ORDER; ++i) {
ttm_pool_type_init(&global_write_combined[i], NULL,
ttm_write_combined, i);
diff --git a/include/drm/ttm/ttm_pool.h b/include/drm/ttm/ttm_pool.h
index 8ce14f9d202a..f5ce60f629ae 100644
--- a/include/drm/ttm/ttm_pool.h
+++ b/include/drm/ttm/ttm_pool.h
@@ -72,7 +72,7 @@ struct ttm_pool {
bool use_dma32;

struct {
- struct ttm_pool_type orders[MAX_ORDER + 1];
+ struct ttm_pool_type *orders;
} caching[TTM_NUM_CACHING_TYPES];
};

@@ -80,7 +80,7 @@ int ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
struct ttm_operation_ctx *ctx);
void ttm_pool_free(struct ttm_pool *pool, struct ttm_tt *tt);

-void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
+int ttm_pool_init(struct ttm_pool *pool, struct device *dev,
bool use_dma_alloc, bool use_dma32);
void ttm_pool_fini(struct ttm_pool *pool);

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b83b481e250b..60d8cce2aed8 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -635,7 +635,7 @@ struct zone {
ZONE_PADDING(_pad1_)

/* free areas of different sizes */
- struct free_area free_area[MAX_ORDER + 1];
+ struct free_area *free_area;

/* zone flags, see below */
unsigned long flags;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3f3af7cd5164..941a94bb8cf0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6195,11 +6195,21 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)

for_each_populated_zone(zone) {
unsigned int order;
- unsigned long nr[MAX_ORDER + 1], flags, total = 0;
- unsigned char types[MAX_ORDER + 1];
+ unsigned long *nr, flags, total = 0;
+ unsigned char *types;

if (show_mem_node_skip(filter, zone_to_nid(zone), nodemask))
continue;
+
+ nr = kmalloc_array(MAX_ORDER + 1, sizeof(unsigned long), GFP_KERNEL);
+ if (!nr)
+ break;
+ types = kmalloc_array(MAX_ORDER + 1, sizeof(unsigned char), GFP_KERNEL);
+ if (!types) {
+ kfree(nr);
+ break;
+ }
+
show_node(zone);
printk(KERN_CONT "%s: ", zone->name);

@@ -7649,8 +7659,8 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
lruvec_init(&pgdat->__lruvec);
}

-static void __meminit zone_init_internals(struct zone *zone, enum zone_type idx, int nid,
- unsigned long remaining_pages)
+static void __init zone_init_internals(struct zone *zone, enum zone_type idx, int nid,
+ unsigned long remaining_pages, bool hotplug)
{
atomic_long_set(&zone->managed_pages, remaining_pages);
zone_set_nid(zone, nid);
@@ -7659,6 +7669,16 @@ static void __meminit zone_init_internals(struct zone *zone, enum zone_type idx,
spin_lock_init(&zone->lock);
zone_seqlock_init(zone);
zone_pcp_init(zone);
+ if (hotplug)
+ zone->free_area =
+ kcalloc_node(MAX_ORDER + 1, sizeof(struct free_area),
+ GFP_KERNEL, nid);
+ else
+ zone->free_area =
+ memblock_alloc_node(sizeof(struct free_area) * (MAX_ORDER + 1),
+ sizeof(struct free_area), nid);
+ BUG_ON(!zone->free_area);
+
}

/*
@@ -7697,7 +7717,7 @@ void __ref free_area_init_core_hotplug(struct pglist_data *pgdat)
}

for (z = 0; z < MAX_NR_ZONES; z++)
- zone_init_internals(&pgdat->node_zones[z], z, nid, 0);
+ zone_init_internals(&pgdat->node_zones[z], z, nid, 0, true);
}
#endif

@@ -7760,7 +7780,7 @@ static void __init free_area_init_core(struct pglist_data *pgdat)
* when the bootmem allocator frees pages into the buddy system.
* And all highmem pages will be managed by the buddy system.
*/
- zone_init_internals(zone, j, nid, freesize);
+ zone_init_internals(zone, j, nid, freesize, false);

if (!size)
continue;
--
2.35.1

2022-08-11 23:57:47

by Zi Yan

[permalink] [raw]
Subject: [RFC PATCH v2 04/12] mm: adapt deferred struct page init to new MAX_ORDER.

From: Zi Yan <[email protected]>

deferred_init only initializes first section of a zone and defers the
rest and the rest of the zone will be initialized in size of a section.
When MAX_ORDER grows beyond a section size, early_page_uninitialised()
did not prevent pages beyond first section from initialization, since it
only checked the starting pfn and assumes MAX_ORDER is smaller than
a section size. In addition, deferred_init_maxorder() uses
MAX_ORDER_NR_PAGES as the initialization unit, which can cause the
initialized chunk of memory overlapping with other initialization jobs.

For the first issue, make early_page_uninitialised() decrease the order
for non-deferred memory initialization when it is bigger than first
section. For the second issue, when adjust pfn alignment in
deferred_init_maxorder(), make sure the alignment is not bigger than
a section size.

Signed-off-by: Zi Yan <[email protected]>
---
mm/internal.h | 2 +-
mm/memblock.c | 6 ++++--
mm/page_alloc.c | 26 +++++++++++++++++++-------
3 files changed, 24 insertions(+), 10 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 1433e3a6fdd0..cbe745670c6e 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -355,7 +355,7 @@ extern int __isolate_free_page(struct page *page, unsigned int order);
extern void __putback_isolated_page(struct page *page, unsigned int order,
int mt);
extern void memblock_free_pages(struct page *page, unsigned long pfn,
- unsigned int order);
+ unsigned int *order);
extern void __free_pages_core(struct page *page, unsigned int order);
extern void prep_compound_page(struct page *page, unsigned int order);
extern void post_alloc_hook(struct page *page, unsigned int order,
diff --git a/mm/memblock.c b/mm/memblock.c
index d1525463c05e..dc2ce6df8fe3 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1640,7 +1640,9 @@ void __init memblock_free_late(phys_addr_t base, phys_addr_t size)
end = PFN_DOWN(base + size);

for (; cursor < end; cursor++) {
- memblock_free_pages(pfn_to_page(cursor), cursor, 0);
+ unsigned int order = 0;
+
+ memblock_free_pages(pfn_to_page(cursor), cursor, &order);
totalram_pages_inc();
}
}
@@ -2035,7 +2037,7 @@ static void __init __free_pages_memory(unsigned long start, unsigned long end)
while (start + (1UL << order) > end)
order--;

- memblock_free_pages(pfn_to_page(start), start, order);
+ memblock_free_pages(pfn_to_page(start), start, &order);

start += (1UL << order);
}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 07ad8074950f..3f3af7cd5164 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -463,13 +463,19 @@ static inline bool deferred_pages_enabled(void)
}

/* Returns true if the struct page for the pfn is uninitialised */
-static inline bool __meminit early_page_uninitialised(unsigned long pfn)
+static inline bool __meminit early_page_uninitialised(unsigned long pfn, unsigned int *order)
{
int nid = early_pfn_to_nid(pfn);

if (node_online(nid) && pfn >= NODE_DATA(nid)->first_deferred_pfn)
return true;

+ /* clamp down order to not exceed first_deferred_pfn */
+ if (order)
+ *order = min_t(unsigned int,
+ *order,
+ ilog2(NODE_DATA(nid)->first_deferred_pfn - pfn));
+
return false;
}

@@ -515,7 +521,7 @@ static inline bool deferred_pages_enabled(void)
return false;
}

-static inline bool early_page_uninitialised(unsigned long pfn)
+static inline bool early_page_uninitialised(unsigned long pfn, unsigned int *order)
{
return false;
}
@@ -1644,7 +1650,7 @@ static void __meminit init_reserved_page(unsigned long pfn)
pg_data_t *pgdat;
int nid, zid;

- if (!early_page_uninitialised(pfn))
+ if (!early_page_uninitialised(pfn, NULL))
return;

nid = early_pfn_to_nid(pfn);
@@ -1800,11 +1806,11 @@ int __meminit early_pfn_to_nid(unsigned long pfn)
#endif /* CONFIG_NUMA */

void __init memblock_free_pages(struct page *page, unsigned long pfn,
- unsigned int order)
+ unsigned int *order)
{
- if (early_page_uninitialised(pfn))
+ if (early_page_uninitialised(pfn, order))
return;
- __free_pages_core(page, order);
+ __free_pages_core(page, *order);
}

/*
@@ -2030,7 +2036,13 @@ static unsigned long __init
deferred_init_maxorder(u64 *i, struct zone *zone, unsigned long *start_pfn,
unsigned long *end_pfn)
{
- unsigned long mo_pfn = ALIGN(*start_pfn + 1, MAX_ORDER_NR_PAGES);
+ /*
+ * deferred_init_memmap_chunk gives out jobs with max size to
+ * PAGES_PER_SECTION. Do not align mo_pfn beyond that.
+ */
+ unsigned long align = min_t(unsigned long,
+ MAX_ORDER_NR_PAGES, PAGES_PER_SECTION);
+ unsigned long mo_pfn = ALIGN(*start_pfn + 1, align);
unsigned long spfn = *start_pfn, epfn = *end_pfn;
unsigned long nr_pages = 0;
u64 j = *i;
--
2.35.1

2022-08-13 01:15:43

by Randy Dunlap

[permalink] [raw]
Subject: Re: [RFC PATCH v2 09/12] mm: Make MAX_ORDER of buddy allocator configurable via Kconfig SET_MAX_ORDER.

Hi--

On 8/11/22 16:16, Zi Yan wrote:

> diff --git a/mm/Kconfig b/mm/Kconfig
> index bbe31e85afee..e558f5679707 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -441,6 +441,20 @@ config SPARSEMEM_VMEMMAP
> pfn_to_page and page_to_pfn operations. This is the most
> efficient option when sufficient kernel resources are available.
>
> +config SET_MAX_ORDER
> + int "Set maximum order of buddy allocator"
> + depends on SPARSEMEM_VMEMMAP && (ARCH_FORCE_MAX_ORDER = 0)
> + range 10 255
> + default "10"
> + help
> + The kernel memory allocator divides physically contiguous memory
> + blocks into "zones", where each zone is a power of two number of
> + pages. This option selects the largest power of two that the kernel
> + keeps in the memory allocator. If you need to allocate very large
> + blocks of physically contiguous memory, then you may need to
> + increase this value. A value of 10 means that the largest free memory
> + block is 2^10 pages.

Please make sure that all lines of help text are indented with one tab + 2 spaces,
as specified in Documentation/process/coding-style.rst.

thanks.
--
~Randy

2022-08-13 01:27:49

by Randy Dunlap

[permalink] [raw]
Subject: Re: [RFC PATCH v2 12/12] mm: make MAX_ORDER a kernel boot time parameter.

Hi--

On 8/11/22 16:16, Zi Yan wrote:
> diff --git a/mm/Kconfig b/mm/Kconfig
> index e558f5679707..acccb919d72d 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -455,6 +455,19 @@ config SET_MAX_ORDER
> increase this value. A value of 10 means that the largest free memory
> block is 2^10 pages.
>
> +config BOOT_TIME_MAX_ORDER
> + bool "Set maximum order of buddy allocator at boot time"
> + depends on SPARSEMEM_VMEMMAP && (ARCH_FORCE_MAX_ORDER != 0 || SET_MAX_ORDER != 0)
> + help
> + It enables users to set the maximum order of buddy allocator at system
> + boot time instead of a static MACRO set at compilation time. Systems with
> + a lot of memory might want to allocate large pages whereas it is much
> + less feasible and desirable for systems with less memory. This option
> + allows different systems to control the largest page they want to
> + allocate. By default, MAX_ORDER will be set to ARCH_FORCE_MAX_ORDER or
> + SET_MAX_ORDER, whichever is non-zero, when the boot time parameter is not
> + set. The maximum of MAX_ORDER is currently limited at 256.

Please make sure that all lines of help text are indented with one tab + 2 spaces,
as specified in Documentation/process/coding-style.rst.

Thanks.
--
~Randy

2022-08-13 02:48:32

by Zi Yan

[permalink] [raw]
Subject: Re: [RFC PATCH v2 12/12] mm: make MAX_ORDER a kernel boot time parameter.



On 12 Aug 2022, at 21:11, Randy Dunlap wrote:

> Hi--
>
> On 8/11/22 16:16, Zi Yan wrote:
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index e558f5679707..acccb919d72d 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -455,6 +455,19 @@ config SET_MAX_ORDER
>> increase this value. A value of 10 means that the largest free memory
>> block is 2^10 pages.
>>
>> +config BOOT_TIME_MAX_ORDER
>> + bool "Set maximum order of buddy allocator at boot time"
>> + depends on SPARSEMEM_VMEMMAP && (ARCH_FORCE_MAX_ORDER != 0 || SET_MAX_ORDER != 0)
>> + help
>> + It enables users to set the maximum order of buddy allocator at system
>> + boot time instead of a static MACRO set at compilation time. Systems with
>> + a lot of memory might want to allocate large pages whereas it is much
>> + less feasible and desirable for systems with less memory. This option
>> + allows different systems to control the largest page they want to
>> + allocate. By default, MAX_ORDER will be set to ARCH_FORCE_MAX_ORDER or
>> + SET_MAX_ORDER, whichever is non-zero, when the boot time parameter is not
>> + set. The maximum of MAX_ORDER is currently limited at 256.
>
> Please make sure that all lines of help text are indented with one tab + 2 spaces,
> as specified in Documentation/process/coding-style.rst.

Thanks. I fixed it locally.

--
Best Regards,
Yan, Zi


Attachments:
signature.asc (871.00 B)
OpenPGP digital signature

2022-08-13 03:00:18

by Randy Dunlap

[permalink] [raw]
Subject: Re: [RFC PATCH v2 09/12] mm: Make MAX_ORDER of buddy allocator configurable via Kconfig SET_MAX_ORDER.



On 8/12/22 19:37, Zi Yan wrote:
>
> On 12 Aug 2022, at 21:11, Randy Dunlap wrote:
>
>> Hi--
>>
>> On 8/11/22 16:16, Zi Yan wrote:
>>
>>> diff --git a/mm/Kconfig b/mm/Kconfig
>>> index bbe31e85afee..e558f5679707 100644
>>> --- a/mm/Kconfig
>>> +++ b/mm/Kconfig
>>> @@ -441,6 +441,20 @@ config SPARSEMEM_VMEMMAP
>>> pfn_to_page and page_to_pfn operations. This is the most
>>> efficient option when sufficient kernel resources are available.
>>>
>>> +config SET_MAX_ORDER
>>> + int "Set maximum order of buddy allocator"
>>> + depends on SPARSEMEM_VMEMMAP && (ARCH_FORCE_MAX_ORDER = 0)
>>> + range 10 255
>>> + default "10"
>>> + help
>>> + The kernel memory allocator divides physically contiguous memory
>>> + blocks into "zones", where each zone is a power of two number of
>>> + pages. This option selects the largest power of two that the kernel
>>> + keeps in the memory allocator. If you need to allocate very large
>>> + blocks of physically contiguous memory, then you may need to
>>> + increase this value. A value of 10 means that the largest free memory
>>> + block is 2^10 pages.
>>
>> Please make sure that all lines of help text are indented with one tab + 2 spaces,
>> as specified in Documentation/process/coding-style.rst.
>
> I guess you mean the wrong indentation of "depends on" here, since all
> the help text is correctly indented. Thanks. I fixed it locally.

Oops, yes. Thanks.

--
~Randy

2022-08-13 03:09:47

by Zi Yan

[permalink] [raw]
Subject: Re: [RFC PATCH v2 09/12] mm: Make MAX_ORDER of buddy allocator configurable via Kconfig SET_MAX_ORDER.


On 12 Aug 2022, at 21:11, Randy Dunlap wrote:

> Hi--
>
> On 8/11/22 16:16, Zi Yan wrote:
>
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index bbe31e85afee..e558f5679707 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -441,6 +441,20 @@ config SPARSEMEM_VMEMMAP
>> pfn_to_page and page_to_pfn operations. This is the most
>> efficient option when sufficient kernel resources are available.
>>
>> +config SET_MAX_ORDER
>> + int "Set maximum order of buddy allocator"
>> + depends on SPARSEMEM_VMEMMAP && (ARCH_FORCE_MAX_ORDER = 0)
>> + range 10 255
>> + default "10"
>> + help
>> + The kernel memory allocator divides physically contiguous memory
>> + blocks into "zones", where each zone is a power of two number of
>> + pages. This option selects the largest power of two that the kernel
>> + keeps in the memory allocator. If you need to allocate very large
>> + blocks of physically contiguous memory, then you may need to
>> + increase this value. A value of 10 means that the largest free memory
>> + block is 2^10 pages.
>
> Please make sure that all lines of help text are indented with one tab + 2 spaces,
> as specified in Documentation/process/coding-style.rst.

I guess you mean the wrong indentation of "depends on" here, since all
the help text is correctly indented. Thanks. I fixed it locally.

--
Best Regards,
Yan, Zi


Attachments:
signature.asc (871.00 B)
OpenPGP digital signature

2022-08-23 14:09:09

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH v2 06/12] fs: proc: use pageblock_nr_pages for reschedule period in read_kcore()

On 12.08.22 01:16, Zi Yan wrote:
> From: Zi Yan <[email protected]>
>
> MAX_ORDER_NR_PAGES can be increased when it becomes a boot time parameter
> in later commits. To make sure read_kcore() reschedule its work in a
> constant period, use pageblock_nr_pages instead for reschedule period,
> since pageblock_nr_pages is a constant and either the same or half of
> MAX_ORDER_NR_PAGES.
>
> Signed-off-by: Zi Yan <[email protected]>
> Cc: Mike Rapoport <[email protected]>
> Cc: David Hildenbrand <[email protected]>
> Cc: Oscar Salvador <[email protected]>
> Cc: Ying Chen <[email protected]>
> Cc: Feng Zhou <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> ---
> fs/proc/kcore.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
> index dff921f7ca33..7dc09d211b48 100644
> --- a/fs/proc/kcore.c
> +++ b/fs/proc/kcore.c
> @@ -491,7 +491,7 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
> }
> }
>
> - if (page_offline_frozen++ % MAX_ORDER_NR_PAGES == 0) {
> + if (page_offline_frozen++ % pageblock_nr_pages == 0) {
> page_offline_thaw();
> cond_resched();
> page_offline_freeze();

Yeah, the exact number doesn't actually matter here.

Reviewed-by: David Hildenbrand <[email protected]>

--
Thanks,

David / dhildenb