2024-05-15 17:49:15

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: Regression in 6.1.81: Missing memory in pmem device

(cc Kees)

On Wed, 15 May 2024 at 19:32, Chaney, Ben <[email protected]> wrote:
>
> Hello,
> I encountered an issue when upgrading to 6.1.89 from 6.1.77. This upgrade caused a breakage in emulated persistent memory. Significant amounts of memory are missing from a pmem device:
>
> fdisk -l /dev/pmem*
> Disk /dev/pmem0: 355.9 GiB, 382117871616 bytes, 746323968 sectors
> Units: sectors of 1 * 512 = 512 bytes
> Sector size (logical/physical): 512 bytes / 4096 bytes
> I/O size (minimum/optimal): 4096 bytes / 4096 bytes
>
> Disk /dev/pmem1: 25.38 GiB, 27246198784 bytes, 53215232 sectors
> Units: sectors of 1 * 512 = 512 bytes
> Sector size (logical/physical): 512 bytes / 4096 bytes
> I/O size (minimum/optimal): 4096 bytes / 4096 bytes
>
> The memmap parameter that created these pmem devices is “memmap=364416M!28672M,367488M!419840M”, which should cause a much larger amount of memory to be allocated to /dev/pmem1. The amount of missing memory and the device it is missing from is randomized on each reboot. There is some amount of memory missing in almost all cases, but not 100% of the time. Notably, the memory that is missing from these devices is not reclaimed by the system for general use. This system in question has 768GB of memory split evenly across two NUMA nodes.
>
> When the error occurs, there are also the following error messages showing up in dmesg:
>
> [ 5.318317] nd_pmem namespace1.0: [mem 0x5c2042c000-0x5ff7ffffff flags 0x200] misaligned, unable to map
> [ 5.335073] nd_pmem: probe of namespace1.0 failed with error -95
>
> Bisection implicates 2dfaeac3f38e4e550d215204eedd97a061fdc118 as the patch that first caused the issue. I believe the cause of the issue is that the EFI stub is randomizing the location of the decompressed kernel without accounting for the memory map, and it is clobbering some of the memory that has been reserved for pmem.
>

Does using 'nokaslr' on the kernel command line work around this?

I think in this particular case, we could just disable physical KASLR
(but retain virtual KASLR) if memmap= appears on the kernel command
line, on the basis that emulated persistent memory is somewhat of a
niche use case, and physical KASLR is not as important as virtual
KASLR (which shouldn't be implicated in this).


2024-05-15 18:30:13

by Kees Cook

[permalink] [raw]
Subject: Re: Regression in 6.1.81: Missing memory in pmem device



On May 15, 2024 10:42:49 AM PDT, Ard Biesheuvel <[email protected]> wrote:
>(cc Kees)
>
>On Wed, 15 May 2024 at 19:32, Chaney, Ben <[email protected]> wrote:
>>
>> Hello,
>> I encountered an issue when upgrading to 6.1.89 from 6.1.77. This upgrade caused a breakage in emulated persistent memory. Significant amounts of memory are missing from a pmem device:
>>
>> fdisk -l /dev/pmem*
>> Disk /dev/pmem0: 355.9 GiB, 382117871616 bytes, 746323968 sectors
>> Units: sectors of 1 * 512 = 512 bytes
>> Sector size (logical/physical): 512 bytes / 4096 bytes
>> I/O size (minimum/optimal): 4096 bytes / 4096 bytes
>>
>> Disk /dev/pmem1: 25.38 GiB, 27246198784 bytes, 53215232 sectors
>> Units: sectors of 1 * 512 = 512 bytes
>> Sector size (logical/physical): 512 bytes / 4096 bytes
>> I/O size (minimum/optimal): 4096 bytes / 4096 bytes
>>
>> The memmap parameter that created these pmem devices is “memmap=364416M!28672M,367488M!419840M”, which should cause a much larger amount of memory to be allocated to /dev/pmem1. The amount of missing memory and the device it is missing from is randomized on each reboot. There is some amount of memory missing in almost all cases, but not 100% of the time. Notably, the memory that is missing from these devices is not reclaimed by the system for general use. This system in question has 768GB of memory split evenly across two NUMA nodes.
>>
>> When the error occurs, there are also the following error messages showing up in dmesg:
>>
>> [ 5.318317] nd_pmem namespace1.0: [mem 0x5c2042c000-0x5ff7ffffff flags 0x200] misaligned, unable to map
>> [ 5.335073] nd_pmem: probe of namespace1.0 failed with error -95
>>
>> Bisection implicates 2dfaeac3f38e4e550d215204eedd97a061fdc118 as the patch that first caused the issue. I believe the cause of the issue is that the EFI stub is randomizing the location of the decompressed kernel without accounting for the memory map, and it is clobbering some of the memory that has been reserved for pmem.
>>
>
>Does using 'nokaslr' on the kernel command line work around this?
>
>I think in this particular case, we could just disable physical KASLR
>(but retain virtual KASLR) if memmap= appears on the kernel command
>line, on the basis that emulated persistent memory is somewhat of a
>niche use case, and physical KASLR is not as important as virtual
>KASLR (which shouldn't be implicated in this).

Yeah, that seems reasonable to me. As long as we put a notice to dmesg that physical ASLR was disabled due to memmap's physical reservation. If this usage becomes more common, we should find a better way, though.

This reminds me a bit of the work Steve has been exploring:
https://lore.kernel.org/all/[email protected]/



--
Kees Cook

2024-05-16 16:38:23

by Chaney, Ben

[permalink] [raw]
Subject: Re: Regression in 6.1.81: Missing memory in pmem device

The 'nokaslr' flag does work around this issue, but using it has a few downsides.

First, we would like the security benefit provided be ASLR. Also, this imposes a restriction on what memmaps are possible. It would then be required to have them offset from the beginning of the memory.

I also think there are a few other features that may be impacted by this, that were not addressed by the patch. crashkernel and pstore both probably need physical kaslr disabled as well.

Thanks,
Ben


On 5/15/24, 2:30 PM, "Kees Cook" <[email protected] <mailto:[email protected]>> wrote:






On May 15, 2024 10:42:49 AM PDT, Ard Biesheuvel <[email protected] <mailto:[email protected]>> wrote:
>(cc Kees)
>
>On Wed, 15 May 2024 at 19:32, Chaney, Ben <[email protected] <mailto:[email protected]>> wrote:
>>
>> Hello,
>> I encountered an issue when upgrading to 6.1.89 from 6.1.77. This upgrade caused a breakage in emulated persistent memory. Significant amounts of memory are missing from a pmem device:
>>
>> fdisk -l /dev/pmem*
>> Disk /dev/pmem0: 355.9 GiB, 382117871616 bytes, 746323968 sectors
>> Units: sectors of 1 * 512 = 512 bytes
>> Sector size (logical/physical): 512 bytes / 4096 bytes
>> I/O size (minimum/optimal): 4096 bytes / 4096 bytes
>>
>> Disk /dev/pmem1: 25.38 GiB, 27246198784 bytes, 53215232 sectors
>> Units: sectors of 1 * 512 = 512 bytes
>> Sector size (logical/physical): 512 bytes / 4096 bytes
>> I/O size (minimum/optimal): 4096 bytes / 4096 bytes
>>
>> The memmap parameter that created these pmem devices is “memmap=364416M!28672M,367488M!419840M”, which should cause a much larger amount of memory to be allocated to /dev/pmem1. The amount of missing memory and the device it is missing from is randomized on each reboot. There is some amount of memory missing in almost all cases, but not 100% of the time. Notably, the memory that is missing from these devices is not reclaimed by the system for general use. This system in question has 768GB of memory split evenly across two NUMA nodes.
>>
>> When the error occurs, there are also the following error messages showing up in dmesg:
>>
>> [ 5.318317] nd_pmem namespace1.0: [mem 0x5c2042c000-0x5ff7ffffff flags 0x200] misaligned, unable to map
>> [ 5.335073] nd_pmem: probe of namespace1.0 failed with error -95
>>
>> Bisection implicates 2dfaeac3f38e4e550d215204eedd97a061fdc118 as the patch that first caused the issue. I believe the cause of the issue is that the EFI stub is randomizing the location of the decompressed kernel without accounting for the memory map, and it is clobbering some of the memory that has been reserved for pmem.
>>
>
>Does using 'nokaslr' on the kernel command line work around this?
>
>I think in this particular case, we could just disable physical KASLR
>(but retain virtual KASLR) if memmap= appears on the kernel command
>line, on the basis that emulated persistent memory is somewhat of a
>niche use case, and physical KASLR is not as important as virtual
>KASLR (which shouldn't be implicated in this).


Yeah, that seems reasonable to me. As long as we put a notice to dmesg that physical ASLR was disabled due to memmap's physical reservation. If this usage becomes more common, we should find a better way, though.


This reminds me a bit of the work Steve has been exploring:
https://urldefense.com/v3/__https://lore.kernel.org/all/[email protected] <mailto:[email protected]>/__;!!GjvTz_vk!WsENA8w3PaYEGppSkEYSpelC-CH2JR35SATJXrj8mHixFG3SC_aj_Ii0ySbmGhQg8V1SV4sszkY$






--
Kees Cook



2024-05-16 17:23:05

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: Regression in 6.1.81: Missing memory in pmem device

On Thu, 16 May 2024 at 16:59, Chaney, Ben <[email protected]> wrote:
>
> The 'nokaslr' flag does work around this issue, but using it has a few downsides.
>
> First, we would like the security benefit provided be ASLR.

We wouldn't need to disable virtual KASLR only physical KASLR.

> Also, this imposes a restriction on what memmaps are possible. It would then be required to have them offset from the beginning of the memory.
>

Relying on the KASLR code to move the kernel away from the base of RAM
is rather risky - even when KASLR is in effect, the logic will fall
back to placement at the base of memory if physical randomization is
not possible for any reason.

> I also think there are a few other features that may be impacted by this, that were not addressed by the patch. crashkernel and pstore both probably need physical kaslr disabled as well.
>

Please reply to the patch if you have any comments on it. Thanks.