2023-08-04 07:55:44

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v4 0/2] perf/core: Bail out early if the request AUX area is out of bound

changes since v3 (per James):
- change to return with existing value -ENOMEM
- split doc part to a separate patch since they are going to be merged through
separate trees.

changes since v2:
- remove unnecessary overflow check (per Peter)

changes since v1:
- drop out patch2 because it has been fixed on upstream (Thanks James for reminding)
- move sanity check into rb_alloc_aux (per Leo)
- add overflow check (per James)

Shuai Xue (2):
perf/core: Bail out early if the request AUX area is out of bound
perf record: Update docs regarding the maximum limitation of AUX area

kernel/events/ring_buffer.c | 3 +++
tools/perf/Documentation/perf-record.txt | 3 +++
2 files changed, 6 insertions(+)

--
2.39.3



2023-08-04 07:57:39

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v4 1/2] perf/core: Bail out early if the request AUX area is out of bound

When perf-record with a large AUX area, e.g 4GB, it fails with:

#perf record -C 0 -m ,4G -e arm_spe_0// -- sleep 1
failed to mmap with 12 (Cannot allocate memory)

and it reveals a WARNING with __alloc_pages():

[ 66.595604] ------------[ cut here ]------------
[ 66.600206] WARNING: CPU: 44 PID: 17573 at mm/page_alloc.c:5568 __alloc_pages+0x1ec/0x248
[ 66.608375] Modules linked in: ip6table_filter(E) ip6_tables(E) iptable_filter(E) ebtable_nat(E) ebtables(E) aes_ce_blk(E) vfat(E) fat(E) aes_ce_cipher(E) crct10dif_ce(E) ghash_ce(E) sm4_ce_cipher(E) sm4(E) sha2_ce(E) sha256_arm64(E) sha1_ce(E) acpi_ipmi(E) sbsa_gwdt(E) sg(E) ipmi_si(E) ipmi_devintf(E) ipmi_msghandler(E) ip_tables(E) sd_mod(E) ast(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) nvme(E) sysimgblt(E) i2c_algo_bit(E) nvme_core(E) drm_shmem_helper(E) ahci(E) t10_pi(E) libahci(E) drm(E) crc64_rocksoft(E) i40e(E) crc64(E) libata(E) i2c_core(E)
[ 66.657719] CPU: 44 PID: 17573 Comm: perf Kdump: loaded Tainted: G E 6.3.0-rc4+ #58
[ 66.666749] Hardware name: Default Default/Default, BIOS 1.2.M1.AL.P.139.00 03/22/2023
[ 66.674650] pstate: 23400009 (nzCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
[ 66.681597] pc : __alloc_pages+0x1ec/0x248
[ 66.685680] lr : __kmalloc_large_node+0xc0/0x1f8
[ 66.690285] sp : ffff800020523980
[ 66.693585] pmr_save: 000000e0
[ 66.696624] x29: ffff800020523980 x28: ffff000832975800 x27: 0000000000000000
[ 66.703746] x26: 0000000000100000 x25: 0000000000100000 x24: ffff8000083615d0
[ 66.710866] x23: 0000000000040dc0 x22: ffff000823d6d140 x21: 000000000000000b
[ 66.717987] x20: 000000000000000b x19: 0000000000000000 x18: 0000000000000030
[ 66.725108] x17: 0000000000000000 x16: ffff800008f05be8 x15: ffff000823d6d6d0
[ 66.732229] x14: 0000000000000000 x13: 343373656761705f x12: 726e202c30206574
[ 66.739350] x11: 00000000ffff7fff x10: 00000000ffff7fff x9 : ffff8000083af570
[ 66.746471] x8 : 00000000000bffe8 x7 : c0000000ffff7fff x6 : 000000000005fff4
[ 66.753592] x5 : 0000000000000000 x4 : ffff000823d6d8d8 x3 : 0000000000000000
[ 66.760713] x2 : 0000000000000000 x1 : 0000000000000001 x0 : 0000000000040dc0
[ 66.767834] Call trace:
[ 66.770267] __alloc_pages+0x1ec/0x248
[ 66.774003] __kmalloc_large_node+0xc0/0x1f8
[ 66.778259] __kmalloc_node+0x134/0x1e8
[ 66.782081] rb_alloc_aux+0xe0/0x298
[ 66.785643] perf_mmap+0x440/0x660
[ 66.789031] mmap_region+0x308/0x8a8
[ 66.792593] do_mmap+0x3c0/0x528
[ 66.795807] vm_mmap_pgoff+0xf4/0x1b8
[ 66.799456] ksys_mmap_pgoff+0x18c/0x218
[ 66.803365] __arm64_sys_mmap+0x38/0x58
[ 66.807187] invoke_syscall+0x50/0x128
[ 66.810922] el0_svc_common.constprop.0+0x58/0x188
[ 66.815698] do_el0_svc+0x34/0x50
[ 66.818999] el0_svc+0x34/0x108
[ 66.822127] el0t_64_sync_handler+0xb8/0xc0
[ 66.826296] el0t_64_sync+0x1a4/0x1a8
[ 66.829946] ---[ end trace 0000000000000000 ]---

'rb->aux_pages' allocated by kcalloc() is a pointer array which is used to
maintains AUX trace pages. The allocated page for this array is physically
contiguous (and virtually contiguous) with an order of 0..MAX_ORDER. If the
size of pointer array crosses the limitation set by MAX_ORDER, it reveals a
WARNING.

So bail out early with -ENOMEM if the request AUX area is out of bound,
e.g.:

#perf record -C 0 -m ,4G -e arm_spe_0// -- sleep 1
failed to mmap with 12 (Cannot allocate memory)

Signed-off-by: Shuai Xue <[email protected]>
---
kernel/events/ring_buffer.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index a0433f37b024..c445e927368d 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -699,6 +699,9 @@ int rb_alloc_aux(struct perf_buffer *rb, struct perf_event *event,
watermark = 0;
}

+ /* Can't allocate more than MAX_ORDER */
+ if (get_order((unsigned long)nr_pages * sizeof(void *)) > MAX_ORDER)
+ return -ENOMEM;
rb->aux_pages = kcalloc_node(nr_pages, sizeof(void *), GFP_KERNEL,
node);
if (!rb->aux_pages)
--
2.39.3


2023-08-04 08:27:41

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v4 2/2] perf record: Update docs regarding the maximum limitation of AUX area

The maximum AUX area is limited by the page size of the system. Update
the documentation to reflect this.

Signed-off-by: Shuai Xue <[email protected]>
---
tools/perf/Documentation/perf-record.txt | 3 +++
1 file changed, 3 insertions(+)

diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
index 680396c56bd1..b0ee7b63da0e 100644
--- a/tools/perf/Documentation/perf-record.txt
+++ b/tools/perf/Documentation/perf-record.txt
@@ -292,6 +292,9 @@ OPTIONS
Also, by adding a comma, the number of mmap pages for AUX
area tracing can be specified.

+ The maximum AUX area is limited by the page size of the system. For
+ example with 4K pages configured, the maximum is 2GiB.
+
-g::
Enables call-graph (stack chain/backtrace) recording for both
kernel space and user space.
--
2.39.3


2023-08-04 09:23:59

by Leo Yan

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] perf record: Update docs regarding the maximum limitation of AUX area

On Fri, Aug 04, 2023 at 03:29:45PM +0800, Shuai Xue wrote:
> The maximum AUX area is limited by the page size of the system. Update
> the documentation to reflect this.
>
> Signed-off-by: Shuai Xue <[email protected]>
> ---
> tools/perf/Documentation/perf-record.txt | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
> index 680396c56bd1..b0ee7b63da0e 100644
> --- a/tools/perf/Documentation/perf-record.txt
> +++ b/tools/perf/Documentation/perf-record.txt
> @@ -292,6 +292,9 @@ OPTIONS
> Also, by adding a comma, the number of mmap pages for AUX
> area tracing can be specified.
>
> + The maximum AUX area is limited by the page size of the system. For
> + example with 4K pages configured, the maximum is 2GiB.
> +

This statement is incorrect as it fails to give out prerequisites.

E.g., on Arm64, for 4KiB, 16KiB or 64KiB base page size, different page
size has different default values for MAX_ORDER. Furthermore, MAX_ORDER
can be set by config ARCH_FORCE_MAX_ORDER, thus we cannot arbitrarily
say the maximum allocation size is 2GiB for 4KiB page size.

Maybe we could consider to use a formula to present the avaliable
maximum buffer size:

PAGE_SIZE << MAX_ORDER
( ---------------------- ) * PAGE_SIZE
sizeof(page_pointer)

PAGE_SIZE << MAX_ORDER : the size of maximal physically
contiguous allocations, which is the maximum size can be
allocated by slab/slub.

Thanks,
Leo

> -g::
> Enables call-graph (stack chain/backtrace) recording for both
> kernel space and user space.
> --
> 2.39.3
>

2023-08-04 09:31:16

by Leo Yan

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] perf/core: Bail out early if the request AUX area is out of bound

On Fri, Aug 04, 2023 at 03:29:44PM +0800, Shuai Xue wrote:
> When perf-record with a large AUX area, e.g 4GB, it fails with:
>
> #perf record -C 0 -m ,4G -e arm_spe_0// -- sleep 1
> failed to mmap with 12 (Cannot allocate memory)
>
> and it reveals a WARNING with __alloc_pages():
>
> [ 66.595604] ------------[ cut here ]------------
> [ 66.600206] WARNING: CPU: 44 PID: 17573 at mm/page_alloc.c:5568 __alloc_pages+0x1ec/0x248
> [ 66.608375] Modules linked in: ip6table_filter(E) ip6_tables(E) iptable_filter(E) ebtable_nat(E) ebtables(E) aes_ce_blk(E) vfat(E) fat(E) aes_ce_cipher(E) crct10dif_ce(E) ghash_ce(E) sm4_ce_cipher(E) sm4(E) sha2_ce(E) sha256_arm64(E) sha1_ce(E) acpi_ipmi(E) sbsa_gwdt(E) sg(E) ipmi_si(E) ipmi_devintf(E) ipmi_msghandler(E) ip_tables(E) sd_mod(E) ast(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) nvme(E) sysimgblt(E) i2c_algo_bit(E) nvme_core(E) drm_shmem_helper(E) ahci(E) t10_pi(E) libahci(E) drm(E) crc64_rocksoft(E) i40e(E) crc64(E) libata(E) i2c_core(E)
> [ 66.657719] CPU: 44 PID: 17573 Comm: perf Kdump: loaded Tainted: G E 6.3.0-rc4+ #58
> [ 66.666749] Hardware name: Default Default/Default, BIOS 1.2.M1.AL.P.139.00 03/22/2023
> [ 66.674650] pstate: 23400009 (nzCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
> [ 66.681597] pc : __alloc_pages+0x1ec/0x248
> [ 66.685680] lr : __kmalloc_large_node+0xc0/0x1f8
> [ 66.690285] sp : ffff800020523980
> [ 66.693585] pmr_save: 000000e0
> [ 66.696624] x29: ffff800020523980 x28: ffff000832975800 x27: 0000000000000000
> [ 66.703746] x26: 0000000000100000 x25: 0000000000100000 x24: ffff8000083615d0
> [ 66.710866] x23: 0000000000040dc0 x22: ffff000823d6d140 x21: 000000000000000b
> [ 66.717987] x20: 000000000000000b x19: 0000000000000000 x18: 0000000000000030
> [ 66.725108] x17: 0000000000000000 x16: ffff800008f05be8 x15: ffff000823d6d6d0
> [ 66.732229] x14: 0000000000000000 x13: 343373656761705f x12: 726e202c30206574
> [ 66.739350] x11: 00000000ffff7fff x10: 00000000ffff7fff x9 : ffff8000083af570
> [ 66.746471] x8 : 00000000000bffe8 x7 : c0000000ffff7fff x6 : 000000000005fff4
> [ 66.753592] x5 : 0000000000000000 x4 : ffff000823d6d8d8 x3 : 0000000000000000
> [ 66.760713] x2 : 0000000000000000 x1 : 0000000000000001 x0 : 0000000000040dc0
> [ 66.767834] Call trace:
> [ 66.770267] __alloc_pages+0x1ec/0x248
> [ 66.774003] __kmalloc_large_node+0xc0/0x1f8
> [ 66.778259] __kmalloc_node+0x134/0x1e8
> [ 66.782081] rb_alloc_aux+0xe0/0x298
> [ 66.785643] perf_mmap+0x440/0x660
> [ 66.789031] mmap_region+0x308/0x8a8
> [ 66.792593] do_mmap+0x3c0/0x528
> [ 66.795807] vm_mmap_pgoff+0xf4/0x1b8
> [ 66.799456] ksys_mmap_pgoff+0x18c/0x218
> [ 66.803365] __arm64_sys_mmap+0x38/0x58
> [ 66.807187] invoke_syscall+0x50/0x128
> [ 66.810922] el0_svc_common.constprop.0+0x58/0x188
> [ 66.815698] do_el0_svc+0x34/0x50
> [ 66.818999] el0_svc+0x34/0x108
> [ 66.822127] el0t_64_sync_handler+0xb8/0xc0
> [ 66.826296] el0t_64_sync+0x1a4/0x1a8
> [ 66.829946] ---[ end trace 0000000000000000 ]---
>
> 'rb->aux_pages' allocated by kcalloc() is a pointer array which is used to
> maintains AUX trace pages. The allocated page for this array is physically
> contiguous (and virtually contiguous) with an order of 0..MAX_ORDER. If the
> size of pointer array crosses the limitation set by MAX_ORDER, it reveals a
> WARNING.
>
> So bail out early with -ENOMEM if the request AUX area is out of bound,
> e.g.:
>
> #perf record -C 0 -m ,4G -e arm_spe_0// -- sleep 1
> failed to mmap with 12 (Cannot allocate memory)
>
> Signed-off-by: Shuai Xue <[email protected]>
> ---
> kernel/events/ring_buffer.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
> index a0433f37b024..c445e927368d 100644
> --- a/kernel/events/ring_buffer.c
> +++ b/kernel/events/ring_buffer.c
> @@ -699,6 +699,9 @@ int rb_alloc_aux(struct perf_buffer *rb, struct perf_event *event,
> watermark = 0;
> }
>
> + /* Can't allocate more than MAX_ORDER */

The comment is confused. I'd like to refine it as:

/*
* kcalloc_node() is unable to allocate buffer if the size is larger
* than: PAGE_SIZE << MAX_ORDER; directly bail out in this case.
*/

To be honest, I am not sure if perf core maintainers like this kind
thing or not. Please seek their opinion before you move forward.

Thanks,
Leo

> + if (get_order((unsigned long)nr_pages * sizeof(void *)) > MAX_ORDER)
> + return -ENOMEM;
> rb->aux_pages = kcalloc_node(nr_pages, sizeof(void *), GFP_KERNEL,
> node);
> if (!rb->aux_pages)
> --
> 2.39.3
>

2023-08-04 11:59:04

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] perf/core: Bail out early if the request AUX area is out of bound



On 2023/8/4 16:59, Leo Yan wrote:
> On Fri, Aug 04, 2023 at 03:29:44PM +0800, Shuai Xue wrote:
>> When perf-record with a large AUX area, e.g 4GB, it fails with:
>>
>> #perf record -C 0 -m ,4G -e arm_spe_0// -- sleep 1
>> failed to mmap with 12 (Cannot allocate memory)
>>
>> and it reveals a WARNING with __alloc_pages():
>>
>> [ 66.595604] ------------[ cut here ]------------
>> [ 66.600206] WARNING: CPU: 44 PID: 17573 at mm/page_alloc.c:5568 __alloc_pages+0x1ec/0x248
>> [ 66.608375] Modules linked in: ip6table_filter(E) ip6_tables(E) iptable_filter(E) ebtable_nat(E) ebtables(E) aes_ce_blk(E) vfat(E) fat(E) aes_ce_cipher(E) crct10dif_ce(E) ghash_ce(E) sm4_ce_cipher(E) sm4(E) sha2_ce(E) sha256_arm64(E) sha1_ce(E) acpi_ipmi(E) sbsa_gwdt(E) sg(E) ipmi_si(E) ipmi_devintf(E) ipmi_msghandler(E) ip_tables(E) sd_mod(E) ast(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) nvme(E) sysimgblt(E) i2c_algo_bit(E) nvme_core(E) drm_shmem_helper(E) ahci(E) t10_pi(E) libahci(E) drm(E) crc64_rocksoft(E) i40e(E) crc64(E) libata(E) i2c_core(E)
>> [ 66.657719] CPU: 44 PID: 17573 Comm: perf Kdump: loaded Tainted: G E 6.3.0-rc4+ #58
>> [ 66.666749] Hardware name: Default Default/Default, BIOS 1.2.M1.AL.P.139.00 03/22/2023
>> [ 66.674650] pstate: 23400009 (nzCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
>> [ 66.681597] pc : __alloc_pages+0x1ec/0x248
>> [ 66.685680] lr : __kmalloc_large_node+0xc0/0x1f8
>> [ 66.690285] sp : ffff800020523980
>> [ 66.693585] pmr_save: 000000e0
>> [ 66.696624] x29: ffff800020523980 x28: ffff000832975800 x27: 0000000000000000
>> [ 66.703746] x26: 0000000000100000 x25: 0000000000100000 x24: ffff8000083615d0
>> [ 66.710866] x23: 0000000000040dc0 x22: ffff000823d6d140 x21: 000000000000000b
>> [ 66.717987] x20: 000000000000000b x19: 0000000000000000 x18: 0000000000000030
>> [ 66.725108] x17: 0000000000000000 x16: ffff800008f05be8 x15: ffff000823d6d6d0
>> [ 66.732229] x14: 0000000000000000 x13: 343373656761705f x12: 726e202c30206574
>> [ 66.739350] x11: 00000000ffff7fff x10: 00000000ffff7fff x9 : ffff8000083af570
>> [ 66.746471] x8 : 00000000000bffe8 x7 : c0000000ffff7fff x6 : 000000000005fff4
>> [ 66.753592] x5 : 0000000000000000 x4 : ffff000823d6d8d8 x3 : 0000000000000000
>> [ 66.760713] x2 : 0000000000000000 x1 : 0000000000000001 x0 : 0000000000040dc0
>> [ 66.767834] Call trace:
>> [ 66.770267] __alloc_pages+0x1ec/0x248
>> [ 66.774003] __kmalloc_large_node+0xc0/0x1f8
>> [ 66.778259] __kmalloc_node+0x134/0x1e8
>> [ 66.782081] rb_alloc_aux+0xe0/0x298
>> [ 66.785643] perf_mmap+0x440/0x660
>> [ 66.789031] mmap_region+0x308/0x8a8
>> [ 66.792593] do_mmap+0x3c0/0x528
>> [ 66.795807] vm_mmap_pgoff+0xf4/0x1b8
>> [ 66.799456] ksys_mmap_pgoff+0x18c/0x218
>> [ 66.803365] __arm64_sys_mmap+0x38/0x58
>> [ 66.807187] invoke_syscall+0x50/0x128
>> [ 66.810922] el0_svc_common.constprop.0+0x58/0x188
>> [ 66.815698] do_el0_svc+0x34/0x50
>> [ 66.818999] el0_svc+0x34/0x108
>> [ 66.822127] el0t_64_sync_handler+0xb8/0xc0
>> [ 66.826296] el0t_64_sync+0x1a4/0x1a8
>> [ 66.829946] ---[ end trace 0000000000000000 ]---
>>
>> 'rb->aux_pages' allocated by kcalloc() is a pointer array which is used to
>> maintains AUX trace pages. The allocated page for this array is physically
>> contiguous (and virtually contiguous) with an order of 0..MAX_ORDER. If the
>> size of pointer array crosses the limitation set by MAX_ORDER, it reveals a
>> WARNING.
>>
>> So bail out early with -ENOMEM if the request AUX area is out of bound,
>> e.g.:
>>
>> #perf record -C 0 -m ,4G -e arm_spe_0// -- sleep 1
>> failed to mmap with 12 (Cannot allocate memory)
>>
>> Signed-off-by: Shuai Xue <[email protected]>
>> ---
>> kernel/events/ring_buffer.c | 3 +++
>> 1 file changed, 3 insertions(+)
>>
>> diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
>> index a0433f37b024..c445e927368d 100644
>> --- a/kernel/events/ring_buffer.c
>> +++ b/kernel/events/ring_buffer.c
>> @@ -699,6 +699,9 @@ int rb_alloc_aux(struct perf_buffer *rb, struct perf_event *event,
>> watermark = 0;
>> }
>>
>> + /* Can't allocate more than MAX_ORDER */
>
> The comment is confused. I'd like to refine it as:
>
> /*
> * kcalloc_node() is unable to allocate buffer if the size is larger
> * than: PAGE_SIZE << MAX_ORDER; directly bail out in this case.
> */

Hi, Leo,

Thank you for your quick feedback. The comment is simplified from Peter's reply in v2
version. Your refined comment is more detailed and it makes sense to me, I would like
to adopt it if @Peter has no other opinions.

> To be honest, I am not sure if perf core maintainers like this kind
> thing or not. Please seek their opinion before you move forward.
>

and hi, all perf core maintainers,

I have not received explicit objection from perf core maintainers @Peter or @James so
I moved forward to address their comments. It's fine to me to wait for more opinions from
perf core maintainers.

Best Regards,
Shuai





2023-08-04 12:48:16

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] perf record: Update docs regarding the maximum limitation of AUX area



On 2023/8/4 16:46, Leo Yan wrote:
> On Fri, Aug 04, 2023 at 03:29:45PM +0800, Shuai Xue wrote:
>> The maximum AUX area is limited by the page size of the system. Update
>> the documentation to reflect this.
>>
>> Signed-off-by: Shuai Xue <[email protected]>
>> ---
>> tools/perf/Documentation/perf-record.txt | 3 +++
>> 1 file changed, 3 insertions(+)
>>
>> diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
>> index 680396c56bd1..b0ee7b63da0e 100644
>> --- a/tools/perf/Documentation/perf-record.txt
>> +++ b/tools/perf/Documentation/perf-record.txt
>> @@ -292,6 +292,9 @@ OPTIONS
>> Also, by adding a comma, the number of mmap pages for AUX
>> area tracing can be specified.
>>
>> + The maximum AUX area is limited by the page size of the system. For
>> + example with 4K pages configured, the maximum is 2GiB.
>> +
>
> This statement is incorrect as it fails to give out prerequisites.
>
> E.g., on Arm64, for 4KiB, 16KiB or 64KiB base page size, different page
> size has different default values for MAX_ORDER. Furthermore, MAX_ORDER
> can be set by config ARCH_FORCE_MAX_ORDER, thus we cannot arbitrarily
> say the maximum allocation size is 2GiB for 4KiB page size.
>

Hi, Leo,

You are right, thank you for point this out.

Maybe we could consider to use a formula to present the avaliable
> maximum buffer size:
>
> PAGE_SIZE << MAX_ORDER
> ( ---------------------- ) * PAGE_SIZE
> sizeof(page_pointer)
>
> PAGE_SIZE << MAX_ORDER : the size of maximal physically
> contiguous allocations, which is the maximum size can be
> allocated by slab/slub.

I agree that a formula to present that limitation is more accurate. But as
@James commented in last v3, "Minor nit: I wouldn't expect a Perf tool user
to know what "MAX_ORDER" is", how about to keep both:

The maximum AUX area is limited by the maximum physically contiguous memory
allocated from slab/slub. It can be calculated with following formula:


PAGE_SIZE << MAX_ORDER
( ---------------------- ) * PAGE_SIZE
sizeof(page_pointer)

For example with 4K pages and MAX_ORDER=10 configured, the maximum AUX area
is 2GiB.

Thank you.

Best Regards,
Shuai



2023-09-06 09:57:31

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] perf/core: Bail out early if the request AUX area is out of bound



On 2023/8/4 19:24, Shuai Xue wrote:
>
>
> On 2023/8/4 16:59, Leo Yan wrote:
>> On Fri, Aug 04, 2023 at 03:29:44PM +0800, Shuai Xue wrote:
>>> When perf-record with a large AUX area, e.g 4GB, it fails with:
>>>
>>> #perf record -C 0 -m ,4G -e arm_spe_0// -- sleep 1
>>> failed to mmap with 12 (Cannot allocate memory)
>>>
>>> and it reveals a WARNING with __alloc_pages():
>>>
>>> [ 66.595604] ------------[ cut here ]------------
>>> [ 66.600206] WARNING: CPU: 44 PID: 17573 at mm/page_alloc.c:5568 __alloc_pages+0x1ec/0x248
>>> [ 66.608375] Modules linked in: ip6table_filter(E) ip6_tables(E) iptable_filter(E) ebtable_nat(E) ebtables(E) aes_ce_blk(E) vfat(E) fat(E) aes_ce_cipher(E) crct10dif_ce(E) ghash_ce(E) sm4_ce_cipher(E) sm4(E) sha2_ce(E) sha256_arm64(E) sha1_ce(E) acpi_ipmi(E) sbsa_gwdt(E) sg(E) ipmi_si(E) ipmi_devintf(E) ipmi_msghandler(E) ip_tables(E) sd_mod(E) ast(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) nvme(E) sysimgblt(E) i2c_algo_bit(E) nvme_core(E) drm_shmem_helper(E) ahci(E) t10_pi(E) libahci(E) drm(E) crc64_rocksoft(E) i40e(E) crc64(E) libata(E) i2c_core(E)
>>> [ 66.657719] CPU: 44 PID: 17573 Comm: perf Kdump: loaded Tainted: G E 6.3.0-rc4+ #58
>>> [ 66.666749] Hardware name: Default Default/Default, BIOS 1.2.M1.AL.P.139.00 03/22/2023
>>> [ 66.674650] pstate: 23400009 (nzCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
>>> [ 66.681597] pc : __alloc_pages+0x1ec/0x248
>>> [ 66.685680] lr : __kmalloc_large_node+0xc0/0x1f8
>>> [ 66.690285] sp : ffff800020523980
>>> [ 66.693585] pmr_save: 000000e0
>>> [ 66.696624] x29: ffff800020523980 x28: ffff000832975800 x27: 0000000000000000
>>> [ 66.703746] x26: 0000000000100000 x25: 0000000000100000 x24: ffff8000083615d0
>>> [ 66.710866] x23: 0000000000040dc0 x22: ffff000823d6d140 x21: 000000000000000b
>>> [ 66.717987] x20: 000000000000000b x19: 0000000000000000 x18: 0000000000000030
>>> [ 66.725108] x17: 0000000000000000 x16: ffff800008f05be8 x15: ffff000823d6d6d0
>>> [ 66.732229] x14: 0000000000000000 x13: 343373656761705f x12: 726e202c30206574
>>> [ 66.739350] x11: 00000000ffff7fff x10: 00000000ffff7fff x9 : ffff8000083af570
>>> [ 66.746471] x8 : 00000000000bffe8 x7 : c0000000ffff7fff x6 : 000000000005fff4
>>> [ 66.753592] x5 : 0000000000000000 x4 : ffff000823d6d8d8 x3 : 0000000000000000
>>> [ 66.760713] x2 : 0000000000000000 x1 : 0000000000000001 x0 : 0000000000040dc0
>>> [ 66.767834] Call trace:
>>> [ 66.770267] __alloc_pages+0x1ec/0x248
>>> [ 66.774003] __kmalloc_large_node+0xc0/0x1f8
>>> [ 66.778259] __kmalloc_node+0x134/0x1e8
>>> [ 66.782081] rb_alloc_aux+0xe0/0x298
>>> [ 66.785643] perf_mmap+0x440/0x660
>>> [ 66.789031] mmap_region+0x308/0x8a8
>>> [ 66.792593] do_mmap+0x3c0/0x528
>>> [ 66.795807] vm_mmap_pgoff+0xf4/0x1b8
>>> [ 66.799456] ksys_mmap_pgoff+0x18c/0x218
>>> [ 66.803365] __arm64_sys_mmap+0x38/0x58
>>> [ 66.807187] invoke_syscall+0x50/0x128
>>> [ 66.810922] el0_svc_common.constprop.0+0x58/0x188
>>> [ 66.815698] do_el0_svc+0x34/0x50
>>> [ 66.818999] el0_svc+0x34/0x108
>>> [ 66.822127] el0t_64_sync_handler+0xb8/0xc0
>>> [ 66.826296] el0t_64_sync+0x1a4/0x1a8
>>> [ 66.829946] ---[ end trace 0000000000000000 ]---
>>>
>>> 'rb->aux_pages' allocated by kcalloc() is a pointer array which is used to
>>> maintains AUX trace pages. The allocated page for this array is physically
>>> contiguous (and virtually contiguous) with an order of 0..MAX_ORDER. If the
>>> size of pointer array crosses the limitation set by MAX_ORDER, it reveals a
>>> WARNING.
>>>
>>> So bail out early with -ENOMEM if the request AUX area is out of bound,
>>> e.g.:
>>>
>>> #perf record -C 0 -m ,4G -e arm_spe_0// -- sleep 1
>>> failed to mmap with 12 (Cannot allocate memory)
>>>
>>> Signed-off-by: Shuai Xue <[email protected]>
>>> ---
>>> kernel/events/ring_buffer.c | 3 +++
>>> 1 file changed, 3 insertions(+)
>>>
>>> diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
>>> index a0433f37b024..c445e927368d 100644
>>> --- a/kernel/events/ring_buffer.c
>>> +++ b/kernel/events/ring_buffer.c
>>> @@ -699,6 +699,9 @@ int rb_alloc_aux(struct perf_buffer *rb, struct perf_event *event,
>>> watermark = 0;
>>> }
>>>
>>> + /* Can't allocate more than MAX_ORDER */
>>
>> The comment is confused. I'd like to refine it as:
>>
>> /*
>> * kcalloc_node() is unable to allocate buffer if the size is larger
>> * than: PAGE_SIZE << MAX_ORDER; directly bail out in this case.
>> */
>
> Hi, Leo,
>
> Thank you for your quick feedback. The comment is simplified from Peter's reply in v2
> version. Your refined comment is more detailed and it makes sense to me, I would like
> to adopt it if @Peter has no other opinions.
>
>> To be honest, I am not sure if perf core maintainers like this kind
>> thing or not. Please seek their opinion before you move forward.
>>
>
> and hi, all perf core maintainers,
>
> I have not received explicit objection from perf core maintainers @Peter or @James so
> I moved forward to address their comments. It's fine to me to wait for more opinions from
> perf core maintainers.
>
> Best Regards,
> Shuai
>

Hi, Leo, and all folks,

Any more comments? Should I move forward to send a new patch?

Thank you.

Best Regards,
Shuai

2023-09-06 10:28:50

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] perf/core: Bail out early if the request AUX area is out of bound



On 2023/9/6 15:02, Leo Yan wrote:
> Hi Shuai,
>
> On Wed, Sep 06, 2023 at 11:27:38AM +0800, Shuai Xue wrote:
>
> [...]
>
>>>>> + /* Can't allocate more than MAX_ORDER */
>>>>
>>>> The comment is confused. I'd like to refine it as:
>>>>
>>>> /*
>>>> * kcalloc_node() is unable to allocate buffer if the size is larger
>>>> * than: PAGE_SIZE << MAX_ORDER; directly bail out in this case.
>>>> */
>>>
>>> Hi, Leo,
>>>
>>> Thank you for your quick feedback. The comment is simplified from Peter's reply in v2
>>> version. Your refined comment is more detailed and it makes sense to me, I would like
>>> to adopt it if @Peter has no other opinions.
>>>
>>>> To be honest, I am not sure if perf core maintainers like this kind
>>>> thing or not. Please seek their opinion before you move forward.
>>>>
>>>
>>> and hi, all perf core maintainers,
>>>
>>> I have not received explicit objection from perf core maintainers @Peter or @James so
>>> I moved forward to address their comments. It's fine to me to wait for more opinions from
>>> perf core maintainers.
>>>
>>> Best Regards,
>>> Shuai
>>>
>>
>> Hi, Leo, and all folks,
>>
>> Any more comments? Should I move forward to send a new patch?
>
> I am afraid I cannot give a reliable suggestion.
>
> Anyway, I personally think the returned error value in this patch is
> better than the kernel oops since the kernel oops is a bit scary for
> tool's users ;) Another reason is the perf core layer should report
> error earlier rather than relying on the page buddy allocation layer
> to detect the memory allocation failure, which is easier for both
> developers and users to understand the failure.
>
> IMHO, a good practice is to respin a new patch set and send out for
> review. Good luck!
>

Hi, Leo,

Thanks for valuable comments.

I will send a new patch set.

Thank you.

Best Regards,
Shuai