2021-05-25 20:50:14

by Qian Cai

[permalink] [raw]
Subject: Arm64 crash while reading memory sysfs

Reverting the patchset "arm64: drop pfn_valid_within() and simplify pfn_valid()" [1] from today's linux-next fixed a crash while reading files under /sys/devices/system/memory.

[1] https://lore.kernel.org/kvmarm/[email protected]/

[ 247.669668][ T1443] kernel BUG at include/linux/mm.h:1383!
[ 247.675987][ T1443] Internal error: Oops - BUG: 0 [#1] SMP
[ 247.681472][ T1443] Modules linked in: loop processor efivarfs ip_tables x_tables ext4 mbcache jbd2 dm_mod igb i2c_algo_bit nvme mlx5_core i2c_core nvme_core firmware_class
[ 247.696894][ T1443] CPU: 15 PID: 1443 Comm: ranbug Not tainted 5.13.0-rc3-next-20210524+ #11
[ 247.705326][ T1443] Hardware name: MiTAC RAPTOR EV-883832-X3-0001/RAPTOR, BIOS 1.6 06/28/2020
[ 247.713842][ T1443] pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--)
[ 247.720536][ T1443] pc : test_pages_in_a_zone+0x23c/0x300
[ 247.725935][ T1443] lr : test_pages_in_a_zone+0x23c/0x300
[ 247.731327][ T1443] sp : ffff800023f8f670
[ 247.735327][ T1443] x29: ffff800023f8f670 x28: 000000000000a000 x27: 000000000000a000
[ 247.743156][ T1443] x26: ffffffbfffe00000 x25: ffff800011c6f738 x24: dfff800000000000
[ 247.750984][ T1443] x23: 0000000000002000 x22: ffff009f7efa29c0 x21: 0000000000000000
[ 247.758812][ T1443] x20: ffffffffffffffff x19: 0000000000008000 x18: ffff00084f9d3370
[ 247.766640][ T1443] x17: 0000000000000000 x16: 0000000000000007 x15: 0000000000000078
[ 247.774467][ T1443] x14: 0000000000000000 x13: ffff800011c6eea4 x12: ffff60136cee0574
[ 247.782295][ T1443] x11: 1fffe0136cee0573 x10: ffff60136cee0573 x9 : dfff800000000000
[ 247.790123][ T1443] x8 : ffff009b67702b9b x7 : 0000000000000001 x6 : ffff009b67702b98
[ 247.797951][ T1443] x5 : 00009fec9311fa8d x4 : ffff009b67702b98 x3 : 1fffe00109f3a529
[ 247.805778][ T1443] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000034
[ 247.813606][ T1443] Call trace:
[ 247.816738][ T1443] test_pages_in_a_zone+0x23c/0x300
[ 247.821784][ T1443] valid_zones_show+0x1e0/0x298
[ 247.826483][ T1443] dev_attr_show+0x50/0xc8
[ 247.830747][ T1443] sysfs_kf_seq_show+0x164/0x368
[ 247.835533][ T1443] kernfs_seq_show+0x130/0x198
[ 247.840143][ T1443] seq_read_iter+0x344/0xd50
[ 247.844581][ T1443] kernfs_fop_read_iter+0x32c/0x4a8
[ 247.849625][ T1443] new_sync_read+0x2bc/0x4e8
[ 247.854063][ T1443] vfs_read+0x18c/0x340
[ 247.858066][ T1443] ksys_read+0xf8/0x1e0
[ 247.862068][ T1443] __arm64_sys_read+0x74/0xa8
[ 247.866591][ T1443] invoke_syscall.constprop.0+0xdc/0x1d8
[ 247.872072][ T1443] do_el0_svc+0xe4/0x298
[ 247.876162][ T1443] el0_svc+0x20/0x30
[ 247.879906][ T1443] el0_sync_handler+0xb0/0xb8
[ 247.884429][ T1443] el0_sync+0x178/0x180
[ 247.888435][ T1443] Code: b0005ee1 912b8021 910b0021 97fc57ac (d4210000)
[ 247.895217][ T1443] ---[ end trace 4ff9f5cbe7443f54 ]---
[ 247.900522][ T1443] Kernel panic - not syncing: Oops - BUG: Fatal exception
[ 247.907501][ T1443] SMP: stopping secondary CPUs
[ 247.912122][ T1443] Kernel Offset: disabled
[ 247.916296][ T1443] CPU features: 0x00000251,20000846
[ 247.921340][ T1443] Memory Limit: none
[ 247.925100][ T1443] ---[ end Kernel panic - not syncing: Oops - BUG: Fatal exception ]---


2021-05-26 06:43:48

by Mike Rapoport

[permalink] [raw]
Subject: Re: Arm64 crash while reading memory sysfs

Hi,

On Tue, May 25, 2021 at 03:25:59PM +0000, Qian Cai (QUIC) wrote:
> Reverting the patchset "arm64: drop pfn_valid_within() and simplify pfn_valid()" [1] from today's linux-next fixed a crash while reading files under /sys/devices/system/memory.

Can you please send the beginning of the boot log, up to the
"Memory: xK/yK available ..."
line?

> [1] https://lore.kernel.org/kvmarm/[email protected]/
>
> [ 247.669668][ T1443] kernel BUG at include/linux/mm.h:1383!
> [ 247.675987][ T1443] Internal error: Oops - BUG: 0 [#1] SMP
> [ 247.681472][ T1443] Modules linked in: loop processor efivarfs ip_tables x_tables ext4 mbcache jbd2 dm_mod igb i2c_algo_bit nvme mlx5_core i2c_core nvme_core firmware_class
> [ 247.696894][ T1443] CPU: 15 PID: 1443 Comm: ranbug Not tainted 5.13.0-rc3-next-20210524+ #11
> [ 247.705326][ T1443] Hardware name: MiTAC RAPTOR EV-883832-X3-0001/RAPTOR, BIOS 1.6 06/28/2020
> [ 247.713842][ T1443] pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--)
> [ 247.720536][ T1443] pc : test_pages_in_a_zone+0x23c/0x300
> [ 247.725935][ T1443] lr : test_pages_in_a_zone+0x23c/0x300
> [ 247.731327][ T1443] sp : ffff800023f8f670
> [ 247.735327][ T1443] x29: ffff800023f8f670 x28: 000000000000a000 x27: 000000000000a000
> [ 247.743156][ T1443] x26: ffffffbfffe00000 x25: ffff800011c6f738 x24: dfff800000000000
> [ 247.750984][ T1443] x23: 0000000000002000 x22: ffff009f7efa29c0 x21: 0000000000000000
> [ 247.758812][ T1443] x20: ffffffffffffffff x19: 0000000000008000 x18: ffff00084f9d3370
> [ 247.766640][ T1443] x17: 0000000000000000 x16: 0000000000000007 x15: 0000000000000078
> [ 247.774467][ T1443] x14: 0000000000000000 x13: ffff800011c6eea4 x12: ffff60136cee0574
> [ 247.782295][ T1443] x11: 1fffe0136cee0573 x10: ffff60136cee0573 x9 : dfff800000000000
> [ 247.790123][ T1443] x8 : ffff009b67702b9b x7 : 0000000000000001 x6 : ffff009b67702b98
> [ 247.797951][ T1443] x5 : 00009fec9311fa8d x4 : ffff009b67702b98 x3 : 1fffe00109f3a529
> [ 247.805778][ T1443] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000034
> [ 247.813606][ T1443] Call trace:
> [ 247.816738][ T1443] test_pages_in_a_zone+0x23c/0x300
> [ 247.821784][ T1443] valid_zones_show+0x1e0/0x298
> [ 247.826483][ T1443] dev_attr_show+0x50/0xc8
> [ 247.830747][ T1443] sysfs_kf_seq_show+0x164/0x368
> [ 247.835533][ T1443] kernfs_seq_show+0x130/0x198
> [ 247.840143][ T1443] seq_read_iter+0x344/0xd50
> [ 247.844581][ T1443] kernfs_fop_read_iter+0x32c/0x4a8
> [ 247.849625][ T1443] new_sync_read+0x2bc/0x4e8
> [ 247.854063][ T1443] vfs_read+0x18c/0x340
> [ 247.858066][ T1443] ksys_read+0xf8/0x1e0
> [ 247.862068][ T1443] __arm64_sys_read+0x74/0xa8
> [ 247.866591][ T1443] invoke_syscall.constprop.0+0xdc/0x1d8
> [ 247.872072][ T1443] do_el0_svc+0xe4/0x298
> [ 247.876162][ T1443] el0_svc+0x20/0x30
> [ 247.879906][ T1443] el0_sync_handler+0xb0/0xb8
> [ 247.884429][ T1443] el0_sync+0x178/0x180
> [ 247.888435][ T1443] Code: b0005ee1 912b8021 910b0021 97fc57ac (d4210000)
> [ 247.895217][ T1443] ---[ end trace 4ff9f5cbe7443f54 ]---
> [ 247.900522][ T1443] Kernel panic - not syncing: Oops - BUG: Fatal exception
> [ 247.907501][ T1443] SMP: stopping secondary CPUs
> [ 247.912122][ T1443] Kernel Offset: disabled
> [ 247.916296][ T1443] CPU features: 0x00000251,20000846
> [ 247.921340][ T1443] Memory Limit: none
> [ 247.925100][ T1443] ---[ end Kernel panic - not syncing: Oops - BUG: Fatal exception ]---
>

--
Sincerely yours,
Mike.

2021-05-26 13:36:09

by Catalin Marinas

[permalink] [raw]
Subject: Re: Arm64 crash while reading memory sysfs

On Wed, May 26, 2021 at 12:09:14PM +0000, Qian Cai (QUIC) wrote:
> [ 0.000000] Early memory node ranges
> [ 0.000000] node 0: [mem 0x0000000090000000-0x0000000091ffffff]

Maybe de-selecting HOLES_IN_ZONE is not correct for arm64 in all
circumstances. In a configuration with 64K pages, MAX_ORDER is 14,
MAX_ORDER_NR_PAGES is 8192, so a 2^29 address range. However, the above
range starts on 2^28 boundary.

SECTION_SIZE_BITS is 29 in this configuration but the corresponding
mem_map[] in the first half of the first section is probably not marked
as reserved as we'd do for NOMAP.

--
Catalin

2021-05-26 17:43:17

by Qian Cai

[permalink] [raw]
Subject: RE: Arm64 crash while reading memory sysfs



> -----Original Message-----
> From: Mike Rapoport <[email protected]>
> Sent: Wednesday, May 26, 2021 2:40 AM
> To: Qian Cai (QUIC) <[email protected]>
> Cc: Andrew Morton <[email protected]>; David Hildenbrand <[email protected]>; Catalin Marinas
> <[email protected]>; Anshuman Khandual <[email protected]>; Ard Biesheuvel <[email protected]>; Linux
> Memory Management List <[email protected]>; Will Deacon <[email protected]>; Marc Zyngier <[email protected]>; Linux Kernel
> Mailing List <[email protected]>; Linux ARM <[email protected]>
> Subject: Re: Arm64 crash while reading memory sysfs
>
> Hi,
>
> On Tue, May 25, 2021 at 03:25:59PM +0000, Qian Cai (QUIC) wrote:
> > Reverting the patchset "arm64: drop pfn_valid_within() and simplify pfn_valid()" [1] from today's linux-next fixed a crash while
> reading files under /sys/devices/system/memory.
>
> Can you please send the beginning of the boot log, up to the
> "Memory: xK/yK available ..."
> line?

[ 0.000000] Booting Linux on physical CPU 0x0000000000 [0x503f0002]
[ 0.000000] Linux version 5.13.0-rc3-next-20210525+ (root@admin5) (gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #27 SMP Tue May 25 19:03:24 UTC 2021
[ 0.000000] efi: EFI v2.70 by American Megatrends
[ 0.000000] efi: ACPI 2.0=0x9ff5b40000 SMBIOS 3.0=0x9ff686fd98 ESRT=0x9ff1d18298 MEMRESERVE=0x9fe6dbed98
[ 0.000000] esrt: Reserving ESRT space from 0x0000009ff1d18298 to 0x0000009ff1d182f8.
[ 0.000000] ACPI: Early table checksum verification disabled
[ 0.000000] ACPI: RSDP 0x0000009FF5B40000 000024 (v02 ALASKA)
[ 0.000000] ACPI: XSDT 0x0000009FF5B40028 000094 (v01 ALASKA A M I 01072009 AMI 00010013)
[ 0.000000] ACPI: FACP 0x0000009FF5B400C0 000114 (v06 Ampere eMAG 00000003 INTL 20190509)
[ 0.000000] ACPI: DSDT 0x0000009FF5B401D8 00765A (v05 ALASKA A M I 00000001 INTL 20190509)
[ 0.000000] ACPI: FIDT 0x0000009FF5B47838 00009C (v01 ALASKA A M I 01072009 AMI 00010013)
[ 0.000000] ACPI: DBG2 0x0000009FF5B478D8 000061 (v00 Ampere eMAG 00000000 INTL 20190509)
[ 0.000000] ACPI: GTDT 0x0000009FF5B47940 000108 (v02 Ampere eMAG 00000001 INTL 20190509)
[ 0.000000] ACPI: IORT 0x0000009FF5B47A48 000BCC (v00 Ampere eMAG 00000000 INTL 20190509)
[ 0.000000] ACPI: MCFG 0x0000009FF5B48618 0000AC (v01 Ampere eMAG 00000001 INTL 20190509)
[ 0.000000] ACPI: SSDT 0x0000009FF5B486C8 00002D (v02 Ampere eMAG 00000001 INTL 20190509)
[ 0.000000] ACPI: SPMI 0x0000009FF5B486F8 000041 (v05 ALASKA A M I 00000000 AMI. 00000000)
[ 0.000000] ACPI: APIC 0x0000009FF5B48740 000A68 (v04 Ampere eMAG 00000004 01000013)
[ 0.000000] ACPI: PCCT 0x0000009FF5B491A8 0005D0 (v01 Ampere eMAG 00000003 01000013)
[ 0.000000] ACPI: BERT 0x0000009FF5B49778 000030 (v01 Ampere eMAG 00000003 INTL 20190509)
[ 0.000000] ACPI: HEST 0x0000009FF5B497A8 000328 (v01 Ampere eMAG 00000003 INTL 20190509)
[ 0.000000] ACPI: SPCR 0x0000009FF5B49AD0 000050 (v02 A M I APTIO V 01072009 AMI. 0005000D)
[ 0.000000] ACPI: PPTT 0x0000009FF5B49B20 000CB8 (v01 Ampere eMAG 00000003 01000013)
[ 0.000000] ACPI: SPCR: console: pl011,mmio32,0x12600000,115200
[ 0.000000] NUMA: Failed to initialise from firmware
[ 0.000000] NUMA: Faking a node at [mem 0x0000000090000000-0x0000009fffffffff]
[ 0.000000] NUMA: NODE_DATA [mem 0x9ffefbabc0-0x9ffefbffff]
[ 0.000000] Zone ranges:
[ 0.000000] Normal [mem 0x0000000090000000-0x0000009fffffffff]
[ 0.000000] Movable zone start for each node
[ 0.000000] Early memory node ranges
[ 0.000000] node 0: [mem 0x0000000090000000-0x0000000091ffffff]
[ 0.000000] node 0: [mem 0x0000000092000000-0x00000000928fffff]
[ 0.000000] node 0: [mem 0x0000000092900000-0x00000000fffbffff]
[ 0.000000] node 0: [mem 0x00000000fffc0000-0x00000000ffffffff]
[ 0.000000] node 0: [mem 0x0000000880000000-0x0000000fffffffff]
[ 0.000000] node 0: [mem 0x0000008800000000-0x0000009ff5aeffff]
[ 0.000000] node 0: [mem 0x0000009ff5af0000-0x0000009ff5b2ffff]
[ 0.000000] node 0: [mem 0x0000009ff5b30000-0x0000009ff5baffff]
[ 0.000000] node 0: [mem 0x0000009ff5bb0000-0x0000009ff7deffff]
[ 0.000000] node 0: [mem 0x0000009ff7df0000-0x0000009ff7e5ffff]
[ 0.000000] node 0: [mem 0x0000009ff7e60000-0x0000009ff7ffffff]
[ 0.000000] node 0: [mem 0x0000009ff8000000-0x0000009fffffffff]
[ 0.000000] Initmem setup node 0 [mem 0x0000000090000000-0x0000009fffffffff]
[ 0.000000] kasan: KernelAddressSanitizer initialized
[ 0.000000] psci: probing for conduit method from ACPI.
[ 0.000000] psci: PSCIv1.0 detected in firmware.
[ 0.000000] psci: Using standard PSCI v0.2 function IDs
[ 0.000000] psci: MIGRATE_INFO_TYPE not supported.
[ 0.000000] psci: SMC Calling Convention v65535.65535
[ 0.000000] ACPI: SRAT not present
[ 0.000000] percpu: Embedded 10 pages/cpu s584592 r8192 d62576 u655360
[ 0.000000] pcpu-alloc: s584592 r8192 d62576 u655360 alloc=10*65536
[ 0.000000] pcpu-alloc: [0] 00 [0] 01 [0] 02 [0] 03 [0] 04 [0] 05 [0] 06 [0] 07
[ 0.000000] pcpu-alloc: [0] 08 [0] 09 [0] 10 [0] 11 [0] 12 [0] 13 [0] 14 [0] 15
[ 0.000000] pcpu-alloc: [0] 16 [0] 17 [0] 18 [0] 19 [0] 20 [0] 21 [0] 22 [0] 23
[ 0.000000] pcpu-alloc: [0] 24 [0] 25 [0] 26 [0] 27 [0] 28 [0] 29 [0] 30 [0] 31
[ 0.000000] Detected PIPT I-cache on CPU0
[ 0.000000] CPU features: detected: GIC system register CPU interface
[ 0.000000] CPU features: detected: Spectre-v2
[ 0.000000] CPU features: detected: Spectre-v4
[ 0.000000] CPU features: detected: Kernel page table isolation (KPTI)
[ 0.000000] Built 1 zonelists, mobility grouping on. Total pages: 2091012
[ 0.000000] Policy zone: Normal
[ 0.000000] Kernel command line: BOOT_IMAGE=/vmlinuz-5.13.0-rc3-next-20210525+ root=/dev/mapper/ubuntu--vg-ubuntu--lv ro cma=1024M iommu.passthrough=1
[ 0.000000] Unknown command line parameters: BOOT_IMAGE=/vmlinuz-5.13.0-rc3-next-20210525+ cma=1024M
[ 0.000000] Dentry cache hash table entries: 8388608 (order: 10, 67108864 bytes, linear)
[ 0.000000] Inode-cache hash table entries: 4194304 (order: 9, 33554432 bytes, linear)
[ 0.000000] mem auto-init: stack:off, heap alloc:on, heap free:off
[ 0.000000] Memory: 777216K/133955584K available (17920K kernel code, 118786K rwdata, 4416K rodata, 6080K init, 67276K bss, 17379072K reserved, 0K cma-reserved)

>
> > [1] https://lore.kernel.org/kvmarm/[email protected]/
> >
> > [ 247.669668][ T1443] kernel BUG at include/linux/mm.h:1383!
> > [ 247.675987][ T1443] Internal error: Oops - BUG: 0 [#1] SMP
> > [ 247.681472][ T1443] Modules linked in: loop processor efivarfs ip_tables x_tables ext4 mbcache jbd2 dm_mod igb i2c_algo_bit
> nvme mlx5_core i2c_core nvme_core firmware_class
> > [ 247.696894][ T1443] CPU: 15 PID: 1443 Comm: ranbug Not tainted 5.13.0-rc3-next-20210524+ #11
> > [ 247.705326][ T1443] Hardware name: MiTAC RAPTOR EV-883832-X3-0001/RAPTOR, BIOS 1.6 06/28/2020
> > [ 247.713842][ T1443] pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--)
> > [ 247.720536][ T1443] pc : test_pages_in_a_zone+0x23c/0x300
> > [ 247.725935][ T1443] lr : test_pages_in_a_zone+0x23c/0x300
> > [ 247.731327][ T1443] sp : ffff800023f8f670
> > [ 247.735327][ T1443] x29: ffff800023f8f670 x28: 000000000000a000 x27: 000000000000a000
> > [ 247.743156][ T1443] x26: ffffffbfffe00000 x25: ffff800011c6f738 x24: dfff800000000000
> > [ 247.750984][ T1443] x23: 0000000000002000 x22: ffff009f7efa29c0 x21: 0000000000000000
> > [ 247.758812][ T1443] x20: ffffffffffffffff x19: 0000000000008000 x18: ffff00084f9d3370
> > [ 247.766640][ T1443] x17: 0000000000000000 x16: 0000000000000007 x15: 0000000000000078
> > [ 247.774467][ T1443] x14: 0000000000000000 x13: ffff800011c6eea4 x12: ffff60136cee0574
> > [ 247.782295][ T1443] x11: 1fffe0136cee0573 x10: ffff60136cee0573 x9 : dfff800000000000
> > [ 247.790123][ T1443] x8 : ffff009b67702b9b x7 : 0000000000000001 x6 : ffff009b67702b98
> > [ 247.797951][ T1443] x5 : 00009fec9311fa8d x4 : ffff009b67702b98 x3 : 1fffe00109f3a529
> > [ 247.805778][ T1443] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000034
> > [ 247.813606][ T1443] Call trace:
> > [ 247.816738][ T1443] test_pages_in_a_zone+0x23c/0x300
> > [ 247.821784][ T1443] valid_zones_show+0x1e0/0x298
> > [ 247.826483][ T1443] dev_attr_show+0x50/0xc8
> > [ 247.830747][ T1443] sysfs_kf_seq_show+0x164/0x368
> > [ 247.835533][ T1443] kernfs_seq_show+0x130/0x198
> > [ 247.840143][ T1443] seq_read_iter+0x344/0xd50
> > [ 247.844581][ T1443] kernfs_fop_read_iter+0x32c/0x4a8
> > [ 247.849625][ T1443] new_sync_read+0x2bc/0x4e8
> > [ 247.854063][ T1443] vfs_read+0x18c/0x340
> > [ 247.858066][ T1443] ksys_read+0xf8/0x1e0
> > [ 247.862068][ T1443] __arm64_sys_read+0x74/0xa8
> > [ 247.866591][ T1443] invoke_syscall.constprop.0+0xdc/0x1d8
> > [ 247.872072][ T1443] do_el0_svc+0xe4/0x298
> > [ 247.876162][ T1443] el0_svc+0x20/0x30
> > [ 247.879906][ T1443] el0_sync_handler+0xb0/0xb8
> > [ 247.884429][ T1443] el0_sync+0x178/0x180
> > [ 247.888435][ T1443] Code: b0005ee1 912b8021 910b0021 97fc57ac (d4210000)
> > [ 247.895217][ T1443] ---[ end trace 4ff9f5cbe7443f54 ]---
> > [ 247.900522][ T1443] Kernel panic - not syncing: Oops - BUG: Fatal exception
> > [ 247.907501][ T1443] SMP: stopping secondary CPUs
> > [ 247.912122][ T1443] Kernel Offset: disabled
> > [ 247.916296][ T1443] CPU features: 0x00000251,20000846
> > [ 247.921340][ T1443] Memory Limit: none
> > [ 247.925100][ T1443] ---[ end Kernel panic - not syncing: Oops - BUG: Fatal exception ]---
> >
>
> --
> Sincerely yours,
> Mike.

2021-05-26 19:22:02

by Mike Rapoport

[permalink] [raw]
Subject: Re: Arm64 crash while reading memory sysfs

On Wed, May 26, 2021 at 12:09:14PM +0000, Qian Cai (QUIC) wrote:
> >
> > On Tue, May 25, 2021 at 03:25:59PM +0000, Qian Cai (QUIC) wrote:
> > > Reverting the patchset "arm64: drop pfn_valid_within() and simplify pfn_valid()" [1] from today's linux-next fixed a crash while
> > reading files under /sys/devices/system/memory.

Does the issue persist of you only revert the latest patch in the series?
In next-20210525 it would be commit
89fb47db72f2 ("arm64-drop-pfn_valid_within-and-simplify-pfn_valid-fix")
and commit
dfe215e9bac2 ("arm64: drop pfn_valid_within() and simplify pfn_valid()").

> > Can you please send the beginning of the boot log, up to the
> > "Memory: xK/yK available ..."
> > line?
>
> [ 0.000000] NUMA: Failed to initialise from firmware
> [ 0.000000] NUMA: Faking a node at [mem 0x0000000090000000-0x0000009fffffffff]
> [ 0.000000] NUMA: NODE_DATA [mem 0x9ffefbabc0-0x9ffefbffff]
> [ 0.000000] Zone ranges:
> [ 0.000000] Normal [mem 0x0000000090000000-0x0000009fffffffff]
> [ 0.000000] Movable zone start for each node
> [ 0.000000] Early memory node ranges
> [ 0.000000] node 0: [mem 0x0000000090000000-0x0000000091ffffff]
> [ 0.000000] node 0: [mem 0x0000000092000000-0x00000000928fffff]
> [ 0.000000] node 0: [mem 0x0000000092900000-0x00000000fffbffff]
> [ 0.000000] node 0: [mem 0x00000000fffc0000-0x00000000ffffffff]
> [ 0.000000] node 0: [mem 0x0000000880000000-0x0000000fffffffff]
> [ 0.000000] node 0: [mem 0x0000008800000000-0x0000009ff5aeffff]
> [ 0.000000] node 0: [mem 0x0000009ff5af0000-0x0000009ff5b2ffff]
> [ 0.000000] node 0: [mem 0x0000009ff5b30000-0x0000009ff5baffff]
> [ 0.000000] node 0: [mem 0x0000009ff5bb0000-0x0000009ff7deffff]
> [ 0.000000] node 0: [mem 0x0000009ff7df0000-0x0000009ff7e5ffff]
> [ 0.000000] node 0: [mem 0x0000009ff7e60000-0x0000009ff7ffffff]
> [ 0.000000] node 0: [mem 0x0000009ff8000000-0x0000009fffffffff]
> [ 0.000000] Initmem setup node 0 [mem 0x0000000090000000-0x0000009fffffffff]
> [ 0.000000] mem auto-init: stack:off, heap alloc:on, heap free:off
> [ 0.000000] Memory: 777216K/133955584K available (17920K kernel code, 118786K rwdata, 4416K rodata, 6080K init, 67276K bss, 17379072K reserved, 0K cma-reserved)

The available and reserved sizes look weird. Can you post the log with
memblock=debug and mminit_loglevel=4 added to the kernel command line?

> > > [1] https://lore.kernel.org/kvmarm/[email protected]/
> > >
> > > [ 247.669668][ T1443] kernel BUG at include/linux/mm.h:1383!
> > > [ 247.675987][ T1443] Internal error: Oops - BUG: 0 [#1] SMP
> > > [ 247.681472][ T1443] Modules linked in: loop processor efivarfs ip_tables x_tables ext4 mbcache jbd2 dm_mod igb i2c_algo_bit
> > nvme mlx5_core i2c_core nvme_core firmware_class
> > > [ 247.696894][ T1443] CPU: 15 PID: 1443 Comm: ranbug Not tainted 5.13.0-rc3-next-20210524+ #11
> > > [ 247.705326][ T1443] Hardware name: MiTAC RAPTOR EV-883832-X3-0001/RAPTOR, BIOS 1.6 06/28/2020
> > > [ 247.713842][ T1443] pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--)
> > > [ 247.720536][ T1443] pc : test_pages_in_a_zone+0x23c/0x300
> > > [ 247.725935][ T1443] lr : test_pages_in_a_zone+0x23c/0x300

Do we know what PFN triggers it? Can you please run with this patch:

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 70620d0dd923..b9d1dd0dae5f 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1443,6 +1443,12 @@ struct zone *test_pages_in_a_zone(unsigned long start_pfn,
i++;
if (i == MAX_ORDER_NR_PAGES || pfn + i >= end_pfn)
continue;
+
+ if (!pfn_valid(pfn))
+ pr_info("%s: pfn %lx is not valid\n", __func__, pfn);
+ else if (PagePoisoned(pfn_to_page(pfn)))
+ dump_page(pfn_to_page(pfn), "");
+
/* Check if we got outside of the zone */
if (zone && !zone_spans_pfn(zone, pfn + i))
return NULL;


--
Sincerely yours,
Mike.

2021-05-26 19:23:43

by Mike Rapoport

[permalink] [raw]
Subject: Re: Arm64 crash while reading memory sysfs

On Wed, May 26, 2021 at 02:04:26PM +0100, Catalin Marinas wrote:
> On Wed, May 26, 2021 at 12:09:14PM +0000, Qian Cai (QUIC) wrote:
> > [ 0.000000] Early memory node ranges
> > [ 0.000000] node 0: [mem 0x0000000090000000-0x0000000091ffffff]
>
> Maybe de-selecting HOLES_IN_ZONE is not correct for arm64 in all
> circumstances. In a configuration with 64K pages, MAX_ORDER is 14,
> MAX_ORDER_NR_PAGES is 8192, so a 2^29 address range. However, the above
> range starts on 2^28 boundary.
>
> SECTION_SIZE_BITS is 29 in this configuration but the corresponding
> mem_map[] in the first half of the first section is probably not marked
> as reserved as we'd do for NOMAP.

We do initialize (or at least we should) the first of the first section in
page_alloc::init_unavailable_range() so the range [0x8000000 - 0x9000000]
will have struct pages marked as reserved.

I think it should be fine to de-select HOLES_IN_ZONE as long as MAX_ORDER
chunk does not exceed a section because we do have memory map there in such
case and HOLES_IN_ZONE along with pfn_valid_within() protected against
access to non-existing memory map entries.

We still have an issue with memory map initialization, and probably I've
missed something in decoupling of "do we have memory there" from
pfn_valid().

--
Sincerely yours,
Mike.

2021-05-27 01:39:00

by Qian Cai

[permalink] [raw]
Subject: Re: Arm64 crash while reading memory sysfs



On 5/26/2021 1:24 PM, Mike Rapoport wrote:
> On Wed, May 26, 2021 at 12:09:14PM +0000, Qian Cai (QUIC) wrote:
>>>
>>> On Tue, May 25, 2021 at 03:25:59PM +0000, Qian Cai (QUIC) wrote:
>>>> Reverting the patchset "arm64: drop pfn_valid_within() and simplify pfn_valid()" [1] from today's linux-next fixed a crash while
>>> reading files under /sys/devices/system/memory.
>
> Does the issue persist of you only revert the latest patch in the series?
> In next-20210525 it would be commit
> 89fb47db72f2 ("arm64-drop-pfn_valid_within-and-simplify-pfn_valid-fix")
> and commit
> dfe215e9bac2 ("arm64: drop pfn_valid_within() and simplify pfn_valid()").

Reverting those two commits alone is enough to fix the issue.

>
>>> Can you please send the beginning of the boot log, up to the
>>> "Memory: xK/yK available ..."
>>> line?
>>
>> [ 0.000000] NUMA: Failed to initialise from firmware
>> [ 0.000000] NUMA: Faking a node at [mem 0x0000000090000000-0x0000009fffffffff]
>> [ 0.000000] NUMA: NODE_DATA [mem 0x9ffefbabc0-0x9ffefbffff]
>> [ 0.000000] Zone ranges:
>> [ 0.000000] Normal [mem 0x0000000090000000-0x0000009fffffffff]
>> [ 0.000000] Movable zone start for each node
>> [ 0.000000] Early memory node ranges
>> [ 0.000000] node 0: [mem 0x0000000090000000-0x0000000091ffffff]
>> [ 0.000000] node 0: [mem 0x0000000092000000-0x00000000928fffff]
>> [ 0.000000] node 0: [mem 0x0000000092900000-0x00000000fffbffff]
>> [ 0.000000] node 0: [mem 0x00000000fffc0000-0x00000000ffffffff]
>> [ 0.000000] node 0: [mem 0x0000000880000000-0x0000000fffffffff]
>> [ 0.000000] node 0: [mem 0x0000008800000000-0x0000009ff5aeffff]
>> [ 0.000000] node 0: [mem 0x0000009ff5af0000-0x0000009ff5b2ffff]
>> [ 0.000000] node 0: [mem 0x0000009ff5b30000-0x0000009ff5baffff]
>> [ 0.000000] node 0: [mem 0x0000009ff5bb0000-0x0000009ff7deffff]
>> [ 0.000000] node 0: [mem 0x0000009ff7df0000-0x0000009ff7e5ffff]
>> [ 0.000000] node 0: [mem 0x0000009ff7e60000-0x0000009ff7ffffff]
>> [ 0.000000] node 0: [mem 0x0000009ff8000000-0x0000009fffffffff]
>> [ 0.000000] Initmem setup node 0 [mem 0x0000000090000000-0x0000009fffffffff]
>> [ 0.000000] mem auto-init: stack:off, heap alloc:on, heap free:off
>> [ 0.000000] Memory: 777216K/133955584K available (17920K kernel code, 118786K rwdata, 4416K rodata, 6080K init, 67276K bss, 17379072K reserved, 0K cma-reserved)
>
> The available and reserved sizes look weird. Can you post the log with
> memblock=debug and mminit_loglevel=4 added to the kernel command line?

http://www.lsbug.org/tmp/dmesg.txt

>
>>>> [1] https://lore.kernel.org/kvmarm/[email protected]/
>>>>
>>>> [ 247.669668][ T1443] kernel BUG at include/linux/mm.h:1383!
>>>> [ 247.675987][ T1443] Internal error: Oops - BUG: 0 [#1] SMP
>>>> [ 247.681472][ T1443] Modules linked in: loop processor efivarfs ip_tables x_tables ext4 mbcache jbd2 dm_mod igb i2c_algo_bit
>>> nvme mlx5_core i2c_core nvme_core firmware_class
>>>> [ 247.696894][ T1443] CPU: 15 PID: 1443 Comm: ranbug Not tainted 5.13.0-rc3-next-20210524+ #11
>>>> [ 247.705326][ T1443] Hardware name: MiTAC RAPTOR EV-883832-X3-0001/RAPTOR, BIOS 1.6 06/28/2020
>>>> [ 247.713842][ T1443] pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--)
>>>> [ 247.720536][ T1443] pc : test_pages_in_a_zone+0x23c/0x300
>>>> [ 247.725935][ T1443] lr : test_pages_in_a_zone+0x23c/0x300
>
> Do we know what PFN triggers it? Can you please run with this patch:

Nothing useful showed up with this patch. Yes, I double-checked that the patch was applied.

>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 70620d0dd923..b9d1dd0dae5f 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1443,6 +1443,12 @@ struct zone *test_pages_in_a_zone(unsigned long start_pfn,
> i++;
> if (i == MAX_ORDER_NR_PAGES || pfn + i >= end_pfn)
> continue;
> +
> + if (!pfn_valid(pfn))
> + pr_info("%s: pfn %lx is not valid\n", __func__, pfn);
> + else if (PagePoisoned(pfn_to_page(pfn)))
> + dump_page(pfn_to_page(pfn), "");
> +
> /* Check if we got outside of the zone */
> if (zone && !zone_spans_pfn(zone, pfn + i))
> return NULL;
>
>

2021-05-27 01:44:10

by Andrew Morton

[permalink] [raw]
Subject: Re: Arm64 crash while reading memory sysfs

On Wed, 26 May 2021 20:16:14 -0400 Qian Cai <[email protected]> wrote:

>
>
> On 5/26/2021 1:24 PM, Mike Rapoport wrote:
> > On Wed, May 26, 2021 at 12:09:14PM +0000, Qian Cai (QUIC) wrote:
> >>>
> >>> On Tue, May 25, 2021 at 03:25:59PM +0000, Qian Cai (QUIC) wrote:
> >>>> Reverting the patchset "arm64: drop pfn_valid_within() and simplify pfn_valid()" [1] from today's linux-next fixed a crash while
> >>> reading files under /sys/devices/system/memory.
> >
> > Does the issue persist of you only revert the latest patch in the series?
> > In next-20210525 it would be commit
> > 89fb47db72f2 ("arm64-drop-pfn_valid_within-and-simplify-pfn_valid-fix")
> > and commit
> > dfe215e9bac2 ("arm64: drop pfn_valid_within() and simplify pfn_valid()").
>
> Reverting those two commits alone is enough to fix the issue.

(cc Stephen)

Thanks, I'll drop

arm64-drop-pfn_valid_within-and-simplify-pfn_valid.patch
arm64-drop-pfn_valid_within-and-simplify-pfn_valid-fix.patch

2021-05-27 14:58:57

by Stephen Rothwell

[permalink] [raw]
Subject: Re: Arm64 crash while reading memory sysfs

Hi Andrew,

On Wed, 26 May 2021 17:31:41 -0700 Andrew Morton <[email protected]> wrote:
>
> On Wed, 26 May 2021 20:16:14 -0400 Qian Cai <[email protected]> wrote:
>
> >
> >
> > On 5/26/2021 1:24 PM, Mike Rapoport wrote:
> > > On Wed, May 26, 2021 at 12:09:14PM +0000, Qian Cai (QUIC) wrote:
> > >>>
> > >>> On Tue, May 25, 2021 at 03:25:59PM +0000, Qian Cai (QUIC) wrote:
> > >>>> Reverting the patchset "arm64: drop pfn_valid_within() and simplify pfn_valid()" [1] from today's linux-next fixed a crash while
> > >>> reading files under /sys/devices/system/memory.
> > >
> > > Does the issue persist of you only revert the latest patch in the series?
> > > In next-20210525 it would be commit
> > > 89fb47db72f2 ("arm64-drop-pfn_valid_within-and-simplify-pfn_valid-fix")
> > > and commit
> > > dfe215e9bac2 ("arm64: drop pfn_valid_within() and simplify pfn_valid()").
> >
> > Reverting those two commits alone is enough to fix the issue.
>
> (cc Stephen)
>
> Thanks, I'll drop
>
> arm64-drop-pfn_valid_within-and-simplify-pfn_valid.patch
> arm64-drop-pfn_valid_within-and-simplify-pfn_valid-fix.patch

Reverted from linux-next for today as well.

--
Cheers,
Stephen Rothwell


Attachments:
(No filename) (499.00 B)
OpenPGP digital signature

2021-05-27 15:14:52

by Mike Rapoport

[permalink] [raw]
Subject: Re: Arm64 crash while reading memory sysfs

On Wed, May 26, 2021 at 08:16:14PM -0400, Qian Cai wrote:
>
> On 5/26/2021 1:24 PM, Mike Rapoport wrote:
> > On Wed, May 26, 2021 at 12:09:14PM +0000, Qian Cai (QUIC) wrote:
> >>>
> >>> On Tue, May 25, 2021 at 03:25:59PM +0000, Qian Cai (QUIC) wrote:
> >>>> Reverting the patchset "arm64: drop pfn_valid_within() and simplify pfn_valid()" [1] from today's linux-next fixed a crash while
> >>> reading files under /sys/devices/system/memory.
> >
> > Does the issue persist of you only revert the latest patch in the series?
> > In next-20210525 it would be commit
> > 89fb47db72f2 ("arm64-drop-pfn_valid_within-and-simplify-pfn_valid-fix")
> > and commit
> > dfe215e9bac2 ("arm64: drop pfn_valid_within() and simplify pfn_valid()").
>
> Reverting those two commits alone is enough to fix the issue.
>
> >
> >>> Can you please send the beginning of the boot log, up to the
> >>> "Memory: xK/yK available ..."
> >>> line?
> >>
> >> [ 0.000000] NUMA: Failed to initialise from firmware
> >> [ 0.000000] NUMA: Faking a node at [mem 0x0000000090000000-0x0000009fffffffff]
> >> [ 0.000000] NUMA: NODE_DATA [mem 0x9ffefbabc0-0x9ffefbffff]
> >> [ 0.000000] Zone ranges:
> >> [ 0.000000] Normal [mem 0x0000000090000000-0x0000009fffffffff]
> >> [ 0.000000] Movable zone start for each node
> >> [ 0.000000] Early memory node ranges
> >> [ 0.000000] node 0: [mem 0x0000000090000000-0x0000000091ffffff]
> >> [ 0.000000] node 0: [mem 0x0000000092000000-0x00000000928fffff]
> >> [ 0.000000] node 0: [mem 0x0000000092900000-0x00000000fffbffff]
> >> [ 0.000000] node 0: [mem 0x00000000fffc0000-0x00000000ffffffff]
> >> [ 0.000000] node 0: [mem 0x0000000880000000-0x0000000fffffffff]
> >> [ 0.000000] node 0: [mem 0x0000008800000000-0x0000009ff5aeffff]
> >> [ 0.000000] node 0: [mem 0x0000009ff5af0000-0x0000009ff5b2ffff]
> >> [ 0.000000] node 0: [mem 0x0000009ff5b30000-0x0000009ff5baffff]
> >> [ 0.000000] node 0: [mem 0x0000009ff5bb0000-0x0000009ff7deffff]
> >> [ 0.000000] node 0: [mem 0x0000009ff7df0000-0x0000009ff7e5ffff]
> >> [ 0.000000] node 0: [mem 0x0000009ff7e60000-0x0000009ff7ffffff]
> >> [ 0.000000] node 0: [mem 0x0000009ff8000000-0x0000009fffffffff]
> >> [ 0.000000] Initmem setup node 0 [mem 0x0000000090000000-0x0000009fffffffff]
> >> [ 0.000000] mem auto-init: stack:off, heap alloc:on, heap free:off
> >> [ 0.000000] Memory: 777216K/133955584K available (17920K kernel code, 118786K rwdata, 4416K rodata, 6080K init, 67276K bss, 17379072K reserved, 0K cma-reserved)
> >
> > The available and reserved sizes look weird. Can you post the log with
> > memblock=debug and mminit_loglevel=4 added to the kernel command line?
>
> http://www.lsbug.org/tmp/dmesg.txt

It seems cut in the middle and even then it's too long to be useful.

Let's drop memblock=debug for now and add this instead:

diff --git a/mm/memblock.c b/mm/memblock.c
index afaefa8fc6ab..3f888bef1994 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -2055,6 +2055,8 @@ void __init memblock_free_all(void)
{
unsigned long pages;

+ __memblock_dump_all();
+
free_unused_memmap();
reset_all_zones_managed_pages();

> >>>> [1] https://lore.kernel.org/kvmarm/[email protected]/
> >>>>
> >>>> [ 247.669668][ T1443] kernel BUG at include/linux/mm.h:1383!
> >>>> [ 247.675987][ T1443] Internal error: Oops - BUG: 0 [#1] SMP
> >>>> [ 247.681472][ T1443] Modules linked in: loop processor efivarfs ip_tables x_tables ext4 mbcache jbd2 dm_mod igb i2c_algo_bit
> >>> nvme mlx5_core i2c_core nvme_core firmware_class
> >>>> [ 247.696894][ T1443] CPU: 15 PID: 1443 Comm: ranbug Not tainted 5.13.0-rc3-next-20210524+ #11
> >>>> [ 247.705326][ T1443] Hardware name: MiTAC RAPTOR EV-883832-X3-0001/RAPTOR, BIOS 1.6 06/28/2020
> >>>> [ 247.713842][ T1443] pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--)
> >>>> [ 247.720536][ T1443] pc : test_pages_in_a_zone+0x23c/0x300
> >>>> [ 247.725935][ T1443] lr : test_pages_in_a_zone+0x23c/0x300
> >
> > Do we know what PFN triggers it? Can you please run with this patch:
>
> Nothing useful showed up with this patch. Yes, I double-checked that the patch was applied.

Sorry, I've missed that the BUG is apparently triggered for pfn + i. Can
you please try this instead:


diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 70620d0dd923..d0e42e09ad84 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1447,6 +1447,13 @@ struct zone *test_pages_in_a_zone(unsigned long start_pfn,
if (zone && !zone_spans_pfn(zone, pfn + i))
return NULL;
page = pfn_to_page(pfn + i);
+
+ if (!pfn_valid(pfn + i))
+ pr_info("%s: pfn %lx is not valid\n", __func__, pfn + i);
+ else if (PagePoisoned(page))
+ dump_page(page, "");
+
+
if (zone && page_zone(page) != zone)
return NULL;
zone = page_zone(page);

--
Sincerely yours,
Mike.

2021-05-27 18:43:46

by Mike Rapoport

[permalink] [raw]
Subject: Re: Arm64 crash while reading memory sysfs

On Thu, May 27, 2021 at 10:33:13AM -0400, Qian Cai wrote:
>
>
> On 5/27/2021 4:56 AM, Mike Rapoport wrote:
> > Let's drop memblock=debug for now and add this instead:
>
> [ 0.000000][ T0] Booting Linux on physical CPU 0x0000000000 [0x503f0002]
> [ 0.000000][ T0] Linux version 5.13.0-rc3-next-20210526+ (root@admin5) (gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #31 SMP Thu May 27 12:32:40 UTC 2021
> [ 0.000000][ T0] Inode-cache hash table entries: 4194304 (order: 9, 33554432 bytes, linear)
> [ 0.000000][ T0] mem auto-init: stack:off, heap alloc:on, heap free:off
> [ 0.000000][ T0] MEMBLOCK configuration:
> [ 0.000000][ T0] memory size = 0x0000001ff0000000 reserved size = 0x0000000421e33ae8
> [ 0.000000][ T0] memory.cnt = 0xc
> [ 0.000000][ T0] Memory: 777216K/133955584K available (17984K kernel code, 118722K rwdata, 4416K rodata, 6080K init, 67276K bss, 17379072K reserved, 0K cma-reserved)

I still cannot understand where most of the memory disappeared, but it
seems entirely different issue.

> > Sorry, I've missed that the BUG is apparently triggered for pfn + i. Can
> > you please try this instead:
>
> [ 259.216661][ T1417] test_pages_in_a_zone: pfn 8000 is not valid
> [ 259.226547][ T1417] page:00000000f4aa8c5c is uninitialized and poisoned
> [ 259.226560][ T1417] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))

Can you please try Anshuman's patch "arm64/mm: Drop HAVE_ARCH_PFN_VALID":

https://lore.kernel.org/lkml/[email protected]

It seems to me that the check for memblock_is_memory() in
arm64::pfn_valid() is what makes init_unavailable_range() to bail out for
section parts that are not actually populated and then we have
VM_BUG_ON_PAGE(PagePoisoned(p)) for these pages.

--
Sincerely yours,
Mike.

2021-05-27 19:04:37

by Catalin Marinas

[permalink] [raw]
Subject: Re: Arm64 crash while reading memory sysfs

On Thu, May 27, 2021 at 07:22:00PM +0300, Mike Rapoport wrote:
> On Thu, May 27, 2021 at 10:33:13AM -0400, Qian Cai wrote:
> > On 5/27/2021 4:56 AM, Mike Rapoport wrote:
> > > Let's drop memblock=debug for now and add this instead:
> >
> > [ 0.000000][ T0] Booting Linux on physical CPU 0x0000000000 [0x503f0002]
> > [ 0.000000][ T0] Linux version 5.13.0-rc3-next-20210526+ (root@admin5) (gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #31 SMP Thu May 27 12:32:40 UTC 2021
> > [ 0.000000][ T0] Inode-cache hash table entries: 4194304 (order: 9, 33554432 bytes, linear)
> > [ 0.000000][ T0] mem auto-init: stack:off, heap alloc:on, heap free:off
> > [ 0.000000][ T0] MEMBLOCK configuration:
> > [ 0.000000][ T0] memory size = 0x0000001ff0000000 reserved size = 0x0000000421e33ae8
> > [ 0.000000][ T0] memory.cnt = 0xc
> > [ 0.000000][ T0] Memory: 777216K/133955584K available (17984K kernel code, 118722K rwdata, 4416K rodata, 6080K init, 67276K bss, 17379072K reserved, 0K cma-reserved)
>
> I still cannot understand where most of the memory disappeared, but it
> seems entirely different issue.
>
> > > Sorry, I've missed that the BUG is apparently triggered for pfn + i. Can
> > > you please try this instead:
> >
> > [ 259.216661][ T1417] test_pages_in_a_zone: pfn 8000 is not valid
> > [ 259.226547][ T1417] page:00000000f4aa8c5c is uninitialized and poisoned
> > [ 259.226560][ T1417] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
>
> Can you please try Anshuman's patch "arm64/mm: Drop HAVE_ARCH_PFN_VALID":
>
> https://lore.kernel.org/lkml/[email protected]
>
> It seems to me that the check for memblock_is_memory() in
> arm64::pfn_valid() is what makes init_unavailable_range() to bail out for
> section parts that are not actually populated and then we have
> VM_BUG_ON_PAGE(PagePoisoned(p)) for these pages.

I acked Anshuman's patch, I think they all need to go in together.

--
Catalin

2021-05-27 21:17:12

by Qian Cai

[permalink] [raw]
Subject: Re: Arm64 crash while reading memory sysfs



On 5/27/2021 4:56 AM, Mike Rapoport wrote:
> Let's drop memblock=debug for now and add this instead:

[ 0.000000][ T0] Booting Linux on physical CPU 0x0000000000 [0x503f0002]
[ 0.000000][ T0] Linux version 5.13.0-rc3-next-20210526+ (root@admin5) (gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #31 SMP Thu May 27 12:32:40 UTC 2021
[ 0.000000][ T0] efi: EFI v2.70 by American Megatrends
[ 0.000000][ T0] efi: ACPI 2.0=0x9ff5b40000 SMBIOS 3.0=0x9ff686fd98 ESRT=0x9ff1d18298 MEMRESERVE=0x9fe6dbed98
[ 0.000000][ T0] esrt: Reserving ESRT space from 0x0000009ff1d18298 to 0x0000009ff1d182f8.
[ 0.000000][ T0] ACPI: Early table checksum verification disabled
[ 0.000000][ T0] ACPI: RSDP 0x0000009FF5B40000 000024 (v02 ALASKA)
[ 0.000000][ T0] ACPI: XSDT 0x0000009FF5B40028 000094 (v01 ALASKA A M I 01072009 AMI 00010013)
[ 0.000000][ T0] ACPI: FACP 0x0000009FF5B400C0 000114 (v06 Ampere eMAG 00000003 INTL 20190509)
[ 0.000000][ T0] ACPI: DSDT 0x0000009FF5B401D8 00765A (v05 ALASKA A M I 00000001 INTL 20190509)
[ 0.000000][ T0] ACPI: FIDT 0x0000009FF5B47838 00009C (v01 ALASKA A M I 01072009 AMI 00010013)
[ 0.000000][ T0] ACPI: DBG2 0x0000009FF5B478D8 000061 (v00 Ampere eMAG 00000000 INTL 20190509)
[ 0.000000][ T0] ACPI: GTDT 0x0000009FF5B47940 000108 (v02 Ampere eMAG 00000001 INTL 20190509)
[ 0.000000][ T0] ACPI: IORT 0x0000009FF5B47A48 000BCC (v00 Ampere eMAG 00000000 INTL 20190509)
[ 0.000000][ T0] ACPI: MCFG 0x0000009FF5B48618 0000AC (v01 Ampere eMAG 00000001 INTL 20190509)
[ 0.000000][ T0] ACPI: SSDT 0x0000009FF5B486C8 00002D (v02 Ampere eMAG 00000001 INTL 20190509)
[ 0.000000][ T0] ACPI: SPMI 0x0000009FF5B486F8 000041 (v05 ALASKA A M I 00000000 AMI. 00000000)
[ 0.000000][ T0] ACPI: APIC 0x0000009FF5B48740 000A68 (v04 Ampere eMAG 00000004 01000013)
[ 0.000000][ T0] ACPI: PCCT 0x0000009FF5B491A8 0005D0 (v01 Ampere eMAG 00000003 01000013)
[ 0.000000][ T0] ACPI: BERT 0x0000009FF5B49778 000030 (v01 Ampere eMAG 00000003 INTL 20190509)
[ 0.000000][ T0] ACPI: HEST 0x0000009FF5B497A8 000328 (v01 Ampere eMAG 00000003 INTL 20190509)
[ 0.000000][ T0] ACPI: SPCR 0x0000009FF5B49AD0 000050 (v02 A M I APTIO V 01072009 AMI. 0005000D)
[ 0.000000][ T0] ACPI: PPTT 0x0000009FF5B49B20 000CB8 (v01 Ampere eMAG 00000003 01000013)
[ 0.000000][ T0] ACPI: SPCR: console: pl011,mmio32,0x12600000,115200
[ 0.000000][ T0] earlycon: pl11 at MMIO32 0x0000000012600000 (options '115200')
[ 0.000000][ T0] printk: bootconsole [pl11] enabled
[ 0.000000][ T0] NUMA: Failed to initialise from firmware
[ 0.000000][ T0] NUMA: Faking a node at [mem 0x0000000090000000-0x0000009fffffffff]
[ 0.000000][ T0] NUMA: NODE_DATA [mem 0x9ffefbabc0-0x9ffefbffff]
[ 0.000000][ T0] Zone ranges:
[ 0.000000][ T0] Normal [mem 0x0000000090000000-0x0000009fffffffff]
[ 0.000000][ T0] Movable zone start for each node
[ 0.000000][ T0] Early memory node ranges
[ 0.000000][ T0] node 0: [mem 0x0000000090000000-0x0000000091ffffff]
[ 0.000000][ T0] node 0: [mem 0x0000000092000000-0x00000000928fffff]
[ 0.000000][ T0] node 0: [mem 0x0000000092900000-0x00000000fffbffff]
[ 0.000000][ T0] node 0: [mem 0x00000000fffc0000-0x00000000ffffffff]
[ 0.000000][ T0] node 0: [mem 0x0000000880000000-0x0000000fffffffff]
[ 0.000000][ T0] node 0: [mem 0x0000008800000000-0x0000009ff5aeffff]
[ 0.000000][ T0] node 0: [mem 0x0000009ff5af0000-0x0000009ff5b2ffff]
[ 0.000000][ T0] node 0: [mem 0x0000009ff5b30000-0x0000009ff5baffff]
[ 0.000000][ T0] node 0: [mem 0x0000009ff5bb0000-0x0000009ff7deffff]
[ 0.000000][ T0] node 0: [mem 0x0000009ff7df0000-0x0000009ff7e5ffff]
[ 0.000000][ T0] node 0: [mem 0x0000009ff7e60000-0x0000009ff7ffffff]
[ 0.000000][ T0] node 0: [mem 0x0000009ff8000000-0x0000009fffffffff]
[ 0.000000][ T0] Initmem setup node 0 [mem 0x0000000090000000-0x0000009fffffffff]
[ 0.000000][ T0] kasan: KernelAddressSanitizer initialized
[ 0.000000][ T0] psci: probing for conduit method from ACPI.
[ 0.000000][ T0] psci: PSCIv1.0 detected in firmware.
[ 0.000000][ T0] psci: Using standard PSCI v0.2 function IDs
[ 0.000000][ T0] psci: MIGRATE_INFO_TYPE not supported.
[ 0.000000][ T0] psci: SMC Calling Convention v65535.65535
[ 0.000000][ T0] ACPI: SRAT not present
[ 0.000000][ T0] percpu: Embedded 10 pages/cpu s584592 r8192 d62576 u655360
[ 0.000000][ T0] Detected PIPT I-cache on CPU0
[ 0.000000][ T0] CPU features: detected: GIC system register CPU interface
[ 0.000000][ T0] CPU features: detected: Spectre-v2
[ 0.000000][ T0] CPU features: detected: Spectre-v4
[ 0.000000][ T0] CPU features: detected: Kernel page table isolation (KPTI)
[ 0.000000][ T0] Built 1 zonelists, mobility grouping on. Total pages: 2091012
[ 0.000000][ T0] Policy zone: Normal
[ 0.000000][ T0] Kernel command line: BOOT_IMAGE=/vmlinuz-5.13.0-rc3-next-20210526+ root=/dev/mapper/ubuntu--vg-ubuntu--lv ro cma=1024M iommu.passthrough=1 earlycon mminit_loglevel=4
[ 0.000000][ T0] Unknown command line parameters: BOOT_IMAGE=/vmlinuz-5.13.0-rc3-next-20210526+ cma=1024M mminit_loglevel=4
[ 0.000000][ T0] Dentry cache hash table entries: 8388608 (order: 10, 67108864 bytes, linear)
[ 0.000000][ T0] Inode-cache hash table entries: 4194304 (order: 9, 33554432 bytes, linear)
[ 0.000000][ T0] mem auto-init: stack:off, heap alloc:on, heap free:off
[ 0.000000][ T0] MEMBLOCK configuration:
[ 0.000000][ T0] memory size = 0x0000001ff0000000 reserved size = 0x0000000421e33ae8
[ 0.000000][ T0] memory.cnt = 0xc
[ 0.000000][ T0] memory[0x0] [0x0000000090000000-0x0000000091ffffff], 0x0000000002000000 bytes on node 0 flags: 0x0
[ 0.000000][ T0] memory[0x1] [0x0000000092000000-0x00000000928fffff], 0x0000000000900000 bytes on node 0 flags: 0x4
[ 0.000000][ T0] memory[0x2] [0x0000000092900000-0x00000000fffbffff], 0x000000006d6c0000 bytes on node 0 flags: 0x0
[ 0.000000][ T0] memory[0x3] [0x00000000fffc0000-0x00000000ffffffff], 0x0000000000040000 bytes on node 0 flags: 0x4
[ 0.000000][ T0] memory[0x4] [0x0000000880000000-0x0000000fffffffff], 0x0000000780000000 bytes on node 0 flags: 0x0
[ 0.000000][ T0] memory[0x5] [0x0000008800000000-0x0000009ff5aeffff], 0x00000017f5af0000 bytes on node 0 flags: 0x0
[ 0.000000][ T0] memory[0x6] [0x0000009ff5af0000-0x0000009ff5b2ffff], 0x0000000000040000 bytes on node 0 flags: 0x4
[ 0.000000][ T0] memory[0x7] [0x0000009ff5b30000-0x0000009ff5baffff], 0x0000000000080000 bytes on node 0 flags: 0x0
[ 0.000000][ T0] memory[0x8] [0x0000009ff5bb0000-0x0000009ff7deffff], 0x0000000002240000 bytes on node 0 flags: 0x4
[ 0.000000][ T0] memory[0x9] [0x0000009ff7df0000-0x0000009ff7e5ffff], 0x0000000000070000 bytes on node 0 flags: 0x0
[ 0.000000][ T0] memory[0xa] [0x0000009ff7e60000-0x0000009ff7ffffff], 0x00000000001a0000 bytes on node 0 flags: 0x4
[ 0.000000][ T0] memory[0xb] [0x0000009ff8000000-0x0000009fffffffff], 0x0000000008000000 bytes on node 0 flags: 0x0
[ 0.000000][ T0] reserved.cnt = 0x16
[ 0.000000][ T0] reserved[0x0] [0x000000088b7c0000-0x000000088fffffff], 0x0000000004840000 bytes flags: 0x0
[ 0.000000][ T0] reserved[0x1] [0x0000009be0000000-0x0000009be07fffff], 0x0000000000800000 bytes flags: 0x0
[ 0.000000][ T0] reserved[0x2] [0x0000009be0da0000-0x0000009be819ffff], 0x0000000007400000 bytes flags: 0x0
[ 0.000000][ T0] reserved[0x3] [0x0000009be81c0000-0x0000009f6c800255], 0x0000000384640256 bytes flags: 0x0
[ 0.000000][ T0] reserved[0x4] [0x0000009f6c810000-0x0000009fe6daffff], 0x000000007a5a0000 bytes flags: 0x0
[ 0.000000][ T0] reserved[0x5] [0x0000009fe6dbed98-0x0000009fe6dbeda7], 0x0000000000000010 bytes flags: 0x0
[ 0.000000][ T0] reserved[0x6] [0x0000009fe6dc0000-0x0000009ff1d0ffff], 0x000000000af50000 bytes flags: 0x0
[ 0.000000][ T0] reserved[0x7] [0x0000009ff1d18298-0x0000009ff1d182f7], 0x0000000000000060 bytes flags: 0x0
[ 0.000000][ T0] reserved[0x8] [0x0000009ff1d1c600-0x0000009ff1d1c61f], 0x0000000000000020 bytes flags: 0x0
[ 0.000000][ T0] reserved[0x9] [0x0000009ff1d1c640-0x0000009ff1d1ce47], 0x0000000000000808 bytes flags: 0x0
[ 0.000000][ T0] reserved[0xa] [0x0000009ff1d1ce80-0x0000009ff1d1d70f], 0x0000000000000890 bytes flags: 0x0
[ 0.000000][ T0] reserved[0xb] [0x0000009ff1d1d740-0x0000009ff1d1e787], 0x0000000000001048 bytes flags: 0x0
[ 0.000000][ T0] reserved[0xc] [0x0000009ff1d1e7c0-0x0000009ff1d1f84f], 0x0000000000001090 bytes flags: 0x0
[ 0.000000][ T0] reserved[0xd] [0x0000009ff1d1f880-0x0000009ff1d1fb1f], 0x00000000000002a0 bytes flags: 0x0
[ 0.000000][ T0] reserved[0xe] [0x0000009ff1d1fb40-0x0000009ff1d1fcc7], 0x0000000000000188 bytes flags: 0x0
[ 0.000000][ T0] reserved[0xf] [0x0000009ff1d1fd00-0x0000009ff5aeffff], 0x0000000003dd0300 bytes flags: 0x0
[ 0.000000][ T0] reserved[0x10] [0x0000009ff5b30000-0x0000009ff5baffff], 0x0000000000080000 bytes flags: 0x0
[ 0.000000][ T0] reserved[0x11] [0x0000009ff7df0000-0x0000009ff7e5ffff], 0x0000000000070000 bytes flags: 0x0
[ 0.000000][ T0] reserved[0x12] [0x0000009ff8000000-0x0000009ffefa0007], 0x0000000006fa0008 bytes flags: 0x0
[ 0.000000][ T0] reserved[0x13] [0x0000009ffefa0040-0x0000009ffefa00d0], 0x0000000000000091 bytes flags: 0x0
[ 0.000000][ T0] reserved[0x14] [0x0000009ffefa0100-0x0000009ffefa0190], 0x0000000000000091 bytes flags: 0x0
[ 0.000000][ T0] reserved[0x15] [0x0000009ffefa01c0-0x0000009fffffffff], 0x000000000105fe40 bytes flags: 0x0
[ 0.000000][ T0] Memory: 777216K/133955584K available (17984K kernel code, 118722K rwdata, 4416K rodata, 6080K init, 67276K bss, 17379072K reserved, 0K cma-reserved)

> Sorry, I've missed that the BUG is apparently triggered for pfn + i. Can
> you please try this instead:

[ 259.216661][ T1417] test_pages_in_a_zone: pfn 8000 is not valid
[ 259.226547][ T1417] page:00000000f4aa8c5c is uninitialized and poisoned
[ 259.226560][ T1417] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))

2021-05-27 23:24:23

by Qian Cai

[permalink] [raw]
Subject: Re: Arm64 crash while reading memory sysfs



On 5/27/2021 12:22 PM, Mike Rapoport wrote:
> On Thu, May 27, 2021 at 10:33:13AM -0400, Qian Cai wrote:
>>
>>
>> On 5/27/2021 4:56 AM, Mike Rapoport wrote:
>>> Let's drop memblock=debug for now and add this instead:
>>
>> [ 0.000000][ T0] Booting Linux on physical CPU 0x0000000000 [0x503f0002]
>> [ 0.000000][ T0] Linux version 5.13.0-rc3-next-20210526+ (root@admin5) (gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #31 SMP Thu May 27 12:32:40 UTC 2021
>> [ 0.000000][ T0] Inode-cache hash table entries: 4194304 (order: 9, 33554432 bytes, linear)
>> [ 0.000000][ T0] mem auto-init: stack:off, heap alloc:on, heap free:off
>> [ 0.000000][ T0] MEMBLOCK configuration:
>> [ 0.000000][ T0] memory size = 0x0000001ff0000000 reserved size = 0x0000000421e33ae8
>> [ 0.000000][ T0] memory.cnt = 0xc
>> [ 0.000000][ T0] Memory: 777216K/133955584K available (17984K kernel code, 118722K rwdata, 4416K rodata, 6080K init, 67276K bss, 17379072K reserved, 0K cma-reserved)
>
> I still cannot understand where most of the memory disappeared, but it
> seems entirely different issue.

Interesting, it seems those memory did come back after booting.

# cat /proc/meminfo
MemTotal: 116656448 kB
MemFree: 110464000 kB
MemAvailable: 101919872 kB
Buffers: 16320 kB
Cached: 118912 kB
SwapCached: 3136 kB
Active: 63360 kB
Inactive: 199936 kB
Active(anon): 9792 kB
Inactive(anon): 132480 kB
Active(file): 53568 kB
Inactive(file): 67456 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 8388544 kB
SwapFree: 8344704 kB
Dirty: 0 kB
Writeback: 0 kB
AnonPages: 125056 kB
Mapped: 44992 kB
Shmem: 14784 kB
KReclaimable: 92160 kB
Slab: 4943424 kB
SReclaimable: 92160 kB
SUnreclaim: 4851264 kB
KernelStack: 24832 kB
PageTables: 10240 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 66716736 kB
Committed_AS: 708096 kB
VmallocTotal: 133143461888 kB
VmallocUsed: 49600 kB
VmallocChunk: 0 kB
Percpu: 45056 kB
HardwareCorrupted: 0 kB
AnonHugePages: 0 kB
ShmemHugePages: 0 kB
ShmemPmdMapped: 0 kB
FileHugePages: 0 kB
FilePmdMapped: 0 kB
CmaTotal: 0 kB
CmaFree: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 524288 kB
Hugetlb: 0 kB

>
>>> Sorry, I've missed that the BUG is apparently triggered for pfn + i. Can
>>> you please try this instead:
>>
>> [ 259.216661][ T1417] test_pages_in_a_zone: pfn 8000 is not valid
>> [ 259.226547][ T1417] page:00000000f4aa8c5c is uninitialized and poisoned
>> [ 259.226560][ T1417] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
>
> Can you please try Anshuman's patch "arm64/mm: Drop HAVE_ARCH_PFN_VALID":
>
> https://lore.kernel.org/lkml/[email protected]
>
> It seems to me that the check for memblock_is_memory() in
> arm64::pfn_valid() is what makes init_unavailable_range() to bail out for
> section parts that are not actually populated and then we have
> VM_BUG_ON_PAGE(PagePoisoned(p)) for these pages.

That patch fixed it.

2021-05-27 23:49:41

by David Hildenbrand

[permalink] [raw]
Subject: Re: Arm64 crash while reading memory sysfs

>> [ 259.216661][ T1417] test_pages_in_a_zone: pfn 8000 is not valid
>> [ 259.226547][ T1417] page:00000000f4aa8c5c is uninitialized and poisoned
>> [ 259.226560][ T1417] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
>
> Can you please try Anshuman's patch "arm64/mm: Drop HAVE_ARCH_PFN_VALID":
>
> https://lore.kernel.org/lkml/[email protected]
>
> It seems to me that the check for memblock_is_memory() in
> arm64::pfn_valid() is what makes init_unavailable_range() to bail out for
> section parts that are not actually populated and then we have
> VM_BUG_ON_PAGE(PagePoisoned(p)) for these pages.
>

Oh, that makes sense to me.

--
Thanks,

David / dhildenb

2021-05-28 00:18:55

by Andrew Morton

[permalink] [raw]
Subject: Re: Arm64 crash while reading memory sysfs

On Thu, 27 May 2021 18:50:48 +0100 Catalin Marinas <[email protected]> wrote:

> > Can you please try Anshuman's patch "arm64/mm: Drop HAVE_ARCH_PFN_VALID":
> >
> > https://lore.kernel.org/lkml/[email protected]
> >
> > It seems to me that the check for memblock_is_memory() in
> > arm64::pfn_valid() is what makes init_unavailable_range() to bail out for
> > section parts that are not actually populated and then we have
> > VM_BUG_ON_PAGE(PagePoisoned(p)) for these pages.
>
> I acked Anshuman's patch, I think they all need to go in together.

That's neat. Specifically which patches are we referring to here?

2021-05-28 06:21:53

by Mike Rapoport

[permalink] [raw]
Subject: Re: Arm64 crash while reading memory sysfs

On Thu, May 27, 2021 at 03:56:44PM -0700, Andrew Morton wrote:
> On Thu, 27 May 2021 18:50:48 +0100 Catalin Marinas <[email protected]> wrote:
>
> > > Can you please try Anshuman's patch "arm64/mm: Drop HAVE_ARCH_PFN_VALID":
> > >
> > > https://lore.kernel.org/lkml/[email protected]
> > >
> > > It seems to me that the check for memblock_is_memory() in
> > > arm64::pfn_valid() is what makes init_unavailable_range() to bail out for
> > > section parts that are not actually populated and then we have
> > > VM_BUG_ON_PAGE(PagePoisoned(p)) for these pages.
> >
> > I acked Anshuman's patch, I think they all need to go in together.
>
> That's neat. Specifically which patches are we referring to here?

arm64: drop pfn_valid_within() and simplify pfn_valid():
https://lore.kernel.org/lkml/[email protected]

arm64/mm: Drop HAVE_ARCH_PFN_VALID:
https://lore.kernel.org/lkml/[email protected]

--
Sincerely yours,
Mike.

2021-06-08 07:07:33

by Anshuman Khandual

[permalink] [raw]
Subject: Re: Arm64 crash while reading memory sysfs



On 5/28/21 10:43 AM, Mike Rapoport wrote:
> On Thu, May 27, 2021 at 03:56:44PM -0700, Andrew Morton wrote:
>> On Thu, 27 May 2021 18:50:48 +0100 Catalin Marinas <[email protected]> wrote:
>>
>>>> Can you please try Anshuman's patch "arm64/mm: Drop HAVE_ARCH_PFN_VALID":
>>>>
>>>> https://lore.kernel.org/lkml/[email protected]
>>>>
>>>> It seems to me that the check for memblock_is_memory() in
>>>> arm64::pfn_valid() is what makes init_unavailable_range() to bail out for
>>>> section parts that are not actually populated and then we have
>>>> VM_BUG_ON_PAGE(PagePoisoned(p)) for these pages.
>>>
>>> I acked Anshuman's patch, I think they all need to go in together.
>>
>> That's neat. Specifically which patches are we referring to here?
>
> arm64: drop pfn_valid_within() and simplify pfn_valid():
> https://lore.kernel.org/lkml/[email protected]
>
> arm64/mm: Drop HAVE_ARCH_PFN_VALID:
> https://lore.kernel.org/lkml/[email protected]

I dont see the above patch (which drops HAVE_ARCH_PFN_VALID on arm64) on linux-next
i.e. next-20210607. I might have missed some earlier context here but do not we want
to fallback on generic pfn_valid() after Mike's series ?

2021-06-14 08:29:44

by Mike Rapoport

[permalink] [raw]
Subject: Re: Arm64 crash while reading memory sysfs

On Tue, Jun 08, 2021 at 12:36:21PM +0530, Anshuman Khandual wrote:
>
>
> On 5/28/21 10:43 AM, Mike Rapoport wrote:
> > On Thu, May 27, 2021 at 03:56:44PM -0700, Andrew Morton wrote:
> >> On Thu, 27 May 2021 18:50:48 +0100 Catalin Marinas <[email protected]> wrote:
> >>
> >>>> Can you please try Anshuman's patch "arm64/mm: Drop HAVE_ARCH_PFN_VALID":
> >>>>
> >>>> https://lore.kernel.org/lkml/[email protected]
> >>>>
> >>>> It seems to me that the check for memblock_is_memory() in
> >>>> arm64::pfn_valid() is what makes init_unavailable_range() to bail out for
> >>>> section parts that are not actually populated and then we have
> >>>> VM_BUG_ON_PAGE(PagePoisoned(p)) for these pages.
> >>>
> >>> I acked Anshuman's patch, I think they all need to go in together.
> >>
> >> That's neat. Specifically which patches are we referring to here?
> >
> > arm64: drop pfn_valid_within() and simplify pfn_valid():
> > https://lore.kernel.org/lkml/[email protected]
> >
> > arm64/mm: Drop HAVE_ARCH_PFN_VALID:
> > https://lore.kernel.org/lkml/[email protected]
>
> I dont see the above patch (which drops HAVE_ARCH_PFN_VALID on arm64) on linux-next
> i.e. next-20210607. I might have missed some earlier context here but do not we want
> to fallback on generic pfn_valid() after Mike's series ?

Andrew,

Can you please pick the two patches above?

--
Sincerely yours,
Mike.

2021-06-15 00:16:27

by Andrew Morton

[permalink] [raw]
Subject: Re: Arm64 crash while reading memory sysfs

On Mon, 14 Jun 2021 11:25:54 +0300 Mike Rapoport <[email protected]> wrote:

> On Tue, Jun 08, 2021 at 12:36:21PM +0530, Anshuman Khandual wrote:
> >
> >
> > On 5/28/21 10:43 AM, Mike Rapoport wrote:
> > > On Thu, May 27, 2021 at 03:56:44PM -0700, Andrew Morton wrote:
> > >> On Thu, 27 May 2021 18:50:48 +0100 Catalin Marinas <[email protected]> wrote:
> > >>
> > >>>> Can you please try Anshuman's patch "arm64/mm: Drop HAVE_ARCH_PFN_VALID":
> > >>>>
> > >>>> https://lore.kernel.org/lkml/[email protected]
> > >>>>
> > >>>> It seems to me that the check for memblock_is_memory() in
> > >>>> arm64::pfn_valid() is what makes init_unavailable_range() to bail out for
> > >>>> section parts that are not actually populated and then we have
> > >>>> VM_BUG_ON_PAGE(PagePoisoned(p)) for these pages.
> > >>>
> > >>> I acked Anshuman's patch, I think they all need to go in together.
> > >>
> > >> That's neat. Specifically which patches are we referring to here?
> > >
> > > arm64: drop pfn_valid_within() and simplify pfn_valid():
> > > https://lore.kernel.org/lkml/[email protected]
> > >
> > > arm64/mm: Drop HAVE_ARCH_PFN_VALID:
> > > https://lore.kernel.org/lkml/[email protected]
> >
> > I dont see the above patch (which drops HAVE_ARCH_PFN_VALID on arm64) on linux-next
> > i.e. next-20210607. I might have missed some earlier context here but do not we want
> > to fallback on generic pfn_valid() after Mike's series ?
>
> Andrew,
>
> Can you please pick the two patches above?

I already had

include-linux-mmzoneh-add-documentation-for-pfn_valid.patch
memblock-update-initialization-of-reserved-pages.patch
arm64-decouple-check-whether-pfn-is-in-linear-map-from-pfn_valid.patch
arm64-drop-pfn_valid_within-and-simplify-pfn_valid.patch

and I just added

arm64-mm-drop-have_arch_pfn_valid.patch

so I think we're all good now?

and I don't think any of this is needed in 5.13 or -stable, correct?

I still have question marks over

https://lkml.kernel.org/r/[email protected] and
https://lkml.kernel.org/r/[email protected]

Is this all OK now?

2021-06-15 06:08:50

by Mike Rapoport

[permalink] [raw]
Subject: Re: Arm64 crash while reading memory sysfs

On Mon, Jun 14, 2021 at 05:13:51PM -0700, Andrew Morton wrote:
> On Mon, 14 Jun 2021 11:25:54 +0300 Mike Rapoport <[email protected]> wrote:
>
> > On Tue, Jun 08, 2021 at 12:36:21PM +0530, Anshuman Khandual wrote:
> > >
> > >
> > > On 5/28/21 10:43 AM, Mike Rapoport wrote:
> > > > On Thu, May 27, 2021 at 03:56:44PM -0700, Andrew Morton wrote:
> > > >> On Thu, 27 May 2021 18:50:48 +0100 Catalin Marinas <[email protected]> wrote:
> > > >>
> > > >>>> Can you please try Anshuman's patch "arm64/mm: Drop HAVE_ARCH_PFN_VALID":
> > > >>>>
> > > >>>> https://lore.kernel.org/lkml/[email protected]
> > > >>>>
> > > >>>> It seems to me that the check for memblock_is_memory() in
> > > >>>> arm64::pfn_valid() is what makes init_unavailable_range() to bail out for
> > > >>>> section parts that are not actually populated and then we have
> > > >>>> VM_BUG_ON_PAGE(PagePoisoned(p)) for these pages.
> > > >>>
> > > >>> I acked Anshuman's patch, I think they all need to go in together.
> > > >>
> > > >> That's neat. Specifically which patches are we referring to here?
> > > >
> > > > arm64: drop pfn_valid_within() and simplify pfn_valid():
> > > > https://lore.kernel.org/lkml/[email protected]
> > > >
> > > > arm64/mm: Drop HAVE_ARCH_PFN_VALID:
> > > > https://lore.kernel.org/lkml/[email protected]
> > >
> > > I dont see the above patch (which drops HAVE_ARCH_PFN_VALID on arm64) on linux-next
> > > i.e. next-20210607. I might have missed some earlier context here but do not we want
> > > to fallback on generic pfn_valid() after Mike's series ?
> >
> > Andrew,
> >
> > Can you please pick the two patches above?
>
> I already had
>
> include-linux-mmzoneh-add-documentation-for-pfn_valid.patch
> memblock-update-initialization-of-reserved-pages.patch
> arm64-decouple-check-whether-pfn-is-in-linear-map-from-pfn_valid.patch
> arm64-drop-pfn_valid_within-and-simplify-pfn_valid.patch
>
> and I just added
>
> arm64-mm-drop-have_arch_pfn_valid.patch
>
> so I think we're all good now?

Yes.

> and I don't think any of this is needed in 5.13 or -stable, correct?

Right.

> I still have question marks over
>
> https://lkml.kernel.org/r/[email protected] and
> https://lkml.kernel.org/r/[email protected]
>
> Is this all OK now?

Yes, it is.

--
Sincerely yours,
Mike.