2023-05-16 09:27:33

by Song Shuai

[permalink] [raw]
Subject: Bug report: kernel paniced when system hibernates

Description of problem:

The latest hibernation support[1] of RISC-V Linux produced a kernel panic.
The entire log has been posted at this link: https://termbin.com/sphl .

How reproducible:

You can reproduce it with the following step :

1. prepare the environment with
- Qemu-virt v8.0.0 (with OpenSbi v1.2)
- Linux v6.4-rc1

2. start the Qemu virt
```sh
$ cat ~/8_riscv/start_latest.sh
#!/bin/bash
/home/song/8_riscv/3_acpi/qemu/ooo/usr/local/bin/qemu-system-riscv64 \
-smp 2 -m 4G -nographic -machine virt \
-kernel /home/song/9_linux/linux/00_rv_test/arch/riscv/boot/Image \
-append "root=/dev/vda ro eaylycon=uart8250,mmio,0x10000000
early_ioremap_debug console=ttyS0 loglevel=8 memblock=debug
no_console_suspend audit=0 3" \
-drive file=/home/song/8_riscv/fedora/stage4-disk.img,format=raw,id=hd0 \
-device virtio-blk-device,drive=hd0 \
-drive file=/home/song/8_riscv/fedora/adisk.qcow2,format=qcow2,id=hd1 \
-device virtio-blk-device,drive=hd1 \
-gdb tcp::1236 #-S
```
3. execute hibernation

```sh
swapon /dev/vdb2 # this is my swap disk

echo disk > /sys/power/state
```

4. Then you will encounter the kernel panic logged in the above link


Other Information:

After my initial and incomplete dig-up, the commit (3335068f8721
"riscv: Use PUD/P4D/PGD pages for the linear mapping")[2]
is closely related to this panic. This commit uses re-defined
`MIN_MEMBLOCK_ADDR` to discover the entire system memory
and extends the `va_pa_offset` from `kernel_map.phys_addr` to
`phys_ram_base` for linear memory mapping.

If the firmware delivered the firmware memory region (like: a PMP
protected region in OpenSbi) without "no-map" propriety,
this commit will result in firmware memory being directly mapped by
`create_linear_mapping_page_table()`.

We can see the mapping via ptdump :
```c
---[ Linear mapping ]---
0xff60000000000000-0xff60000000200000 0x0000000080000000 2M PMD D A G
. . W R V ------------- the firmware memory
0xff60000000200000-0xff60000000c00000 0x0000000080200000 10M PMD D A G . . . R V
0xff60000000c00000-0xff60000001000000 0x0000000080c00000 4M PMD D A G . . W R V
0xff60000001000000-0xff60000001600000 0x0000000081000000 6M PMD D A G . . . R V
0xff60000001600000-0xff60000040000000 0x0000000081600000 1002M PMD D A
G . . W R V
0xff60000040000000-0xff60000100000000 0x00000000c0000000 3G PUD D A G . . W R V
---[ Modules/BPF mapping ]---
---[ Kernel mapping ]---
0xffffffff80000000-0xffffffff80a00000 0x0000000080200000 10M PMD D A G . X . R V
0xffffffff80a00000-0xffffffff80c00000 0x0000000080c00000 2M PMD D A G . . . R V
0xffffffff80c00000-0xffffffff80e00000 0x0000000080e00000 2M PMD D A G . . W R V
0xffffffff80e00000-0xffffffff81400000 0x0000000081000000 6M PMD D A G . . . R V
0xffffffff81400000-0xffffffff81800000 0x0000000081600000 4M PMD
```

In the hibernation process, `swsusp_save()` calls
`copy_data_pages(&copy_bm, &orig_bm)` to copy these two memory
bitmaps,
the Oops(load access fault) occurred while copying the page of
PAGE_OFFSET (which maps the firmware memory).

I also did two other tests:
Test1:

The hibernation works well in the kernel with the commit 3335068f8721
reverted at least in the current environment.

Test2:

I built a simple kernel module to simulate the access of the value of
`PAGE_OFFSET` address, and the same panic occurred with the load
access fault.
So hibernation seems not the only case to trigger this panic.

Finally, should we always leave the firmware memory with
`MEMBLOCK_NOMAP` flag by some efforts from Linux or OpenSbi (at least
in the current environment) or any other suggestions?

Please correct me if I'm wrong.

[1]: https://lore.kernel.org/r/[email protected]
[2]: https://lore.kernel.org/r/[email protected]

--
Thanks,
Song


2023-05-16 10:07:14

by Sia Jee Heng

[permalink] [raw]
Subject: RE: Bug report: kernel paniced when system hibernates

Hi Song,

Thanks for the investigation. Indeed, the exposure of the PMP reserved region to the kernel page table is causing the problem.
Here is the similar report: https://groups.google.com/u/0/a/groups.riscv.org/g/sw-dev/c/ITXwaKfA6z8

Thanks
Regards
Jee Heng

> -----Original Message-----
> From: Song Shuai <[email protected]>
> Sent: Tuesday, May 16, 2023 5:24 PM
> To: [email protected]; [email protected]; Andrew Jones <[email protected]>; [email protected];
> [email protected]; JeeHeng Sia <[email protected]>; Leyfoon Tan <[email protected]>; Mason Huo
> <[email protected]>; Paul Walmsley <[email protected]>; Conor Dooley <[email protected]>; Guo
> Ren <[email protected]>
> Cc: [email protected]; [email protected]
> Subject: Bug report: kernel paniced when system hibernates
>
> Description of problem:
>
> The latest hibernation support[1] of RISC-V Linux produced a kernel panic.
> The entire log has been posted at this link: https://termbin.com/sphl .
>
> How reproducible:
>
> You can reproduce it with the following step :
>
> 1. prepare the environment with
> - Qemu-virt v8.0.0 (with OpenSbi v1.2)
> - Linux v6.4-rc1
>
> 2. start the Qemu virt
> ```sh
> $ cat ~/8_riscv/start_latest.sh
> #!/bin/bash
> /home/song/8_riscv/3_acpi/qemu/ooo/usr/local/bin/qemu-system-riscv64 \
> -smp 2 -m 4G -nographic -machine virt \
> -kernel /home/song/9_linux/linux/00_rv_test/arch/riscv/boot/Image \
> -append "root=/dev/vda ro eaylycon=uart8250,mmio,0x10000000
> early_ioremap_debug console=ttyS0 loglevel=8 memblock=debug
> no_console_suspend audit=0 3" \
> -drive file=/home/song/8_riscv/fedora/stage4-disk.img,format=raw,id=hd0 \
> -device virtio-blk-device,drive=hd0 \
> -drive file=/home/song/8_riscv/fedora/adisk.qcow2,format=qcow2,id=hd1 \
> -device virtio-blk-device,drive=hd1 \
> -gdb tcp::1236 #-S
> ```
> 3. execute hibernation
>
> ```sh
> swapon /dev/vdb2 # this is my swap disk
>
> echo disk > /sys/power/state
> ```
>
> 4. Then you will encounter the kernel panic logged in the above link
>
>
> Other Information:
>
> After my initial and incomplete dig-up, the commit (3335068f8721
> "riscv: Use PUD/P4D/PGD pages for the linear mapping")[2]
> is closely related to this panic. This commit uses re-defined
> `MIN_MEMBLOCK_ADDR` to discover the entire system memory
> and extends the `va_pa_offset` from `kernel_map.phys_addr` to
> `phys_ram_base` for linear memory mapping.
>
> If the firmware delivered the firmware memory region (like: a PMP
> protected region in OpenSbi) without "no-map" propriety,
> this commit will result in firmware memory being directly mapped by
> `create_linear_mapping_page_table()`.
>
> We can see the mapping via ptdump :
> ```c
> ---[ Linear mapping ]---
> 0xff60000000000000-0xff60000000200000 0x0000000080000000 2M PMD D A G
> . . W R V ------------- the firmware memory
> 0xff60000000200000-0xff60000000c00000 0x0000000080200000 10M PMD D A G . . . R V
> 0xff60000000c00000-0xff60000001000000 0x0000000080c00000 4M PMD D A G . . W R V
> 0xff60000001000000-0xff60000001600000 0x0000000081000000 6M PMD D A G . . . R V
> 0xff60000001600000-0xff60000040000000 0x0000000081600000 1002M PMD D A
> G . . W R V
> 0xff60000040000000-0xff60000100000000 0x00000000c0000000 3G PUD D A G . . W R V
> ---[ Modules/BPF mapping ]---
> ---[ Kernel mapping ]---
> 0xffffffff80000000-0xffffffff80a00000 0x0000000080200000 10M PMD D A G . X . R V
> 0xffffffff80a00000-0xffffffff80c00000 0x0000000080c00000 2M PMD D A G . . . R V
> 0xffffffff80c00000-0xffffffff80e00000 0x0000000080e00000 2M PMD D A G . . W R V
> 0xffffffff80e00000-0xffffffff81400000 0x0000000081000000 6M PMD D A G . . . R V
> 0xffffffff81400000-0xffffffff81800000 0x0000000081600000 4M PMD
> ```
>
> In the hibernation process, `swsusp_save()` calls
> `copy_data_pages(&copy_bm, &orig_bm)` to copy these two memory
> bitmaps,
> the Oops(load access fault) occurred while copying the page of
> PAGE_OFFSET (which maps the firmware memory).
>
> I also did two other tests:
> Test1:
>
> The hibernation works well in the kernel with the commit 3335068f8721
> reverted at least in the current environment.
>
> Test2:
>
> I built a simple kernel module to simulate the access of the value of
> `PAGE_OFFSET` address, and the same panic occurred with the load
> access fault.
> So hibernation seems not the only case to trigger this panic.
>
> Finally, should we always leave the firmware memory with
> `MEMBLOCK_NOMAP` flag by some efforts from Linux or OpenSbi (at least
> in the current environment) or any other suggestions?
>
> Please correct me if I'm wrong.
>
> [1]: https://lore.kernel.org/r/[email protected]
> [2]: https://lore.kernel.org/r/[email protected]
>
> --
> Thanks,
> Song

2023-05-16 11:37:22

by Sia Jee Heng

[permalink] [raw]
Subject: RE: Bug report: kernel paniced when system hibernates



> -----Original Message-----
> From: Alexandre Ghiti <[email protected]>
> Sent: Tuesday, May 16, 2023 7:15 PM
> To: JeeHeng Sia <[email protected]>
> Cc: Song Shuai <[email protected]>; [email protected]; Andrew Jones <[email protected]>; [email protected];
> [email protected]; Leyfoon Tan <[email protected]>; Mason Huo <[email protected]>; Paul Walmsley
> <[email protected]>; Conor Dooley <[email protected]>; Guo Ren <[email protected]>; linux-
> [email protected]; [email protected]
> Subject: Re: Bug report: kernel paniced when system hibernates
>
> Hi JeeHeng,
>
> On Tue, May 16, 2023 at 11:55 AM JeeHeng Sia
> <[email protected]> wrote:
> >
> > Hi Song,
> >
> > Thanks for the investigation. Indeed, the exposure of the PMP reserved region to the kernel page table is causing the problem.
> > Here is the similar report: https://groups.google.com/u/0/a/groups.riscv.org/g/sw-dev/c/ITXwaKfA6z8
>
> IMO, we should discuss the kernel related stuff on the linux riscv ML,
> I'm not subscribed to the group above and you did not answer my last
> direct emails regarding this problem either.
Hi Alex, it's strange that I haven't received a reply from you. Seems like I have forgot to update to the mailing list.
By the way, the reason hibernation failed with the recent page table is because during the initiation of the hibernation process, the hibernation core accesses the page table to check if the page can be saved. Since the page is reserved and is available to the page table, the kernel tries to access the PMP region, which causes a kernel panic. The code can be found under:
/kernel/power/snapshot.c/ line #1340
if (PageReserved(page) && (!kernel_page_present(page) || pfn_is_nosave(pfn)))
If the virtual memory is mapped to the kernel page table, then hibernation core will try to access the page.

> Thanks,
>
> Alex
>
> >
> > Thanks
> > Regards
> > Jee Heng
> >
> > > -----Original Message-----
> > > From: Song Shuai <[email protected]>
> > > Sent: Tuesday, May 16, 2023 5:24 PM
> > > To: [email protected]; [email protected]; Andrew Jones <[email protected]>; [email protected];
> > > [email protected]; JeeHeng Sia <[email protected]>; Leyfoon Tan <[email protected]>; Mason Huo
> > > <[email protected]>; Paul Walmsley <[email protected]>; Conor Dooley <[email protected]>;
> Guo
> > > Ren <[email protected]>
> > > Cc: [email protected]; [email protected]
> > > Subject: Bug report: kernel paniced when system hibernates
> > >
> > > Description of problem:
> > >
> > > The latest hibernation support[1] of RISC-V Linux produced a kernel panic.
> > > The entire log has been posted at this link: https://termbin.com/sphl .
> > >
> > > How reproducible:
> > >
> > > You can reproduce it with the following step :
> > >
> > > 1. prepare the environment with
> > > - Qemu-virt v8.0.0 (with OpenSbi v1.2)
> > > - Linux v6.4-rc1
> > >
> > > 2. start the Qemu virt
> > > ```sh
> > > $ cat ~/8_riscv/start_latest.sh
> > > #!/bin/bash
> > > /home/song/8_riscv/3_acpi/qemu/ooo/usr/local/bin/qemu-system-riscv64 \
> > > -smp 2 -m 4G -nographic -machine virt \
> > > -kernel /home/song/9_linux/linux/00_rv_test/arch/riscv/boot/Image \
> > > -append "root=/dev/vda ro eaylycon=uart8250,mmio,0x10000000
> > > early_ioremap_debug console=ttyS0 loglevel=8 memblock=debug
> > > no_console_suspend audit=0 3" \
> > > -drive file=/home/song/8_riscv/fedora/stage4-disk.img,format=raw,id=hd0 \
> > > -device virtio-blk-device,drive=hd0 \
> > > -drive file=/home/song/8_riscv/fedora/adisk.qcow2,format=qcow2,id=hd1 \
> > > -device virtio-blk-device,drive=hd1 \
> > > -gdb tcp::1236 #-S
> > > ```
> > > 3. execute hibernation
> > >
> > > ```sh
> > > swapon /dev/vdb2 # this is my swap disk
> > >
> > > echo disk > /sys/power/state
> > > ```
> > >
> > > 4. Then you will encounter the kernel panic logged in the above link
> > >
> > >
> > > Other Information:
> > >
> > > After my initial and incomplete dig-up, the commit (3335068f8721
> > > "riscv: Use PUD/P4D/PGD pages for the linear mapping")[2]
> > > is closely related to this panic. This commit uses re-defined
> > > `MIN_MEMBLOCK_ADDR` to discover the entire system memory
> > > and extends the `va_pa_offset` from `kernel_map.phys_addr` to
> > > `phys_ram_base` for linear memory mapping.
> > >
> > > If the firmware delivered the firmware memory region (like: a PMP
> > > protected region in OpenSbi) without "no-map" propriety,
> > > this commit will result in firmware memory being directly mapped by
> > > `create_linear_mapping_page_table()`.
> > >
> > > We can see the mapping via ptdump :
> > > ```c
> > > ---[ Linear mapping ]---
> > > 0xff60000000000000-0xff60000000200000 0x0000000080000000 2M PMD D A G
> > > . . W R V ------------- the firmware memory
> > > 0xff60000000200000-0xff60000000c00000 0x0000000080200000 10M PMD D A G . . . R V
> > > 0xff60000000c00000-0xff60000001000000 0x0000000080c00000 4M PMD D A G . . W R V
> > > 0xff60000001000000-0xff60000001600000 0x0000000081000000 6M PMD D A G . . . R V
> > > 0xff60000001600000-0xff60000040000000 0x0000000081600000 1002M PMD D A
> > > G . . W R V
> > > 0xff60000040000000-0xff60000100000000 0x00000000c0000000 3G PUD D A G . . W R V
> > > ---[ Modules/BPF mapping ]---
> > > ---[ Kernel mapping ]---
> > > 0xffffffff80000000-0xffffffff80a00000 0x0000000080200000 10M PMD D A G . X . R V
> > > 0xffffffff80a00000-0xffffffff80c00000 0x0000000080c00000 2M PMD D A G . . . R V
> > > 0xffffffff80c00000-0xffffffff80e00000 0x0000000080e00000 2M PMD D A G . . W R V
> > > 0xffffffff80e00000-0xffffffff81400000 0x0000000081000000 6M PMD D A G . . . R V
> > > 0xffffffff81400000-0xffffffff81800000 0x0000000081600000 4M PMD
> > > ```
> > >
> > > In the hibernation process, `swsusp_save()` calls
> > > `copy_data_pages(&copy_bm, &orig_bm)` to copy these two memory
> > > bitmaps,
> > > the Oops(load access fault) occurred while copying the page of
> > > PAGE_OFFSET (which maps the firmware memory).
> > >
> > > I also did two other tests:
> > > Test1:
> > >
> > > The hibernation works well in the kernel with the commit 3335068f8721
> > > reverted at least in the current environment.
> > >
> > > Test2:
> > >
> > > I built a simple kernel module to simulate the access of the value of
> > > `PAGE_OFFSET` address, and the same panic occurred with the load
> > > access fault.
> > > So hibernation seems not the only case to trigger this panic.
> > >
> > > Finally, should we always leave the firmware memory with
> > > `MEMBLOCK_NOMAP` flag by some efforts from Linux or OpenSbi (at least
> > > in the current environment) or any other suggestions?
> > >
> > > Please correct me if I'm wrong.
> > >
> > > [1]: https://lore.kernel.org/r/[email protected]
> > > [2]: https://lore.kernel.org/r/[email protected]
> > >
> > > --
> > > Thanks,
> > > Song

2023-05-16 11:37:54

by Alexandre Ghiti

[permalink] [raw]
Subject: Re: Bug report: kernel paniced when system hibernates

Hi JeeHeng,

On Tue, May 16, 2023 at 11:55 AM JeeHeng Sia
<[email protected]> wrote:
>
> Hi Song,
>
> Thanks for the investigation. Indeed, the exposure of the PMP reserved region to the kernel page table is causing the problem.
> Here is the similar report: https://groups.google.com/u/0/a/groups.riscv.org/g/sw-dev/c/ITXwaKfA6z8

IMO, we should discuss the kernel related stuff on the linux riscv ML,
I'm not subscribed to the group above and you did not answer my last
direct emails regarding this problem either.

Thanks,

Alex

>
> Thanks
> Regards
> Jee Heng
>
> > -----Original Message-----
> > From: Song Shuai <[email protected]>
> > Sent: Tuesday, May 16, 2023 5:24 PM
> > To: [email protected]; [email protected]; Andrew Jones <[email protected]>; [email protected];
> > [email protected]; JeeHeng Sia <[email protected]>; Leyfoon Tan <[email protected]>; Mason Huo
> > <[email protected]>; Paul Walmsley <[email protected]>; Conor Dooley <[email protected]>; Guo
> > Ren <[email protected]>
> > Cc: [email protected]; [email protected]
> > Subject: Bug report: kernel paniced when system hibernates
> >
> > Description of problem:
> >
> > The latest hibernation support[1] of RISC-V Linux produced a kernel panic.
> > The entire log has been posted at this link: https://termbin.com/sphl .
> >
> > How reproducible:
> >
> > You can reproduce it with the following step :
> >
> > 1. prepare the environment with
> > - Qemu-virt v8.0.0 (with OpenSbi v1.2)
> > - Linux v6.4-rc1
> >
> > 2. start the Qemu virt
> > ```sh
> > $ cat ~/8_riscv/start_latest.sh
> > #!/bin/bash
> > /home/song/8_riscv/3_acpi/qemu/ooo/usr/local/bin/qemu-system-riscv64 \
> > -smp 2 -m 4G -nographic -machine virt \
> > -kernel /home/song/9_linux/linux/00_rv_test/arch/riscv/boot/Image \
> > -append "root=/dev/vda ro eaylycon=uart8250,mmio,0x10000000
> > early_ioremap_debug console=ttyS0 loglevel=8 memblock=debug
> > no_console_suspend audit=0 3" \
> > -drive file=/home/song/8_riscv/fedora/stage4-disk.img,format=raw,id=hd0 \
> > -device virtio-blk-device,drive=hd0 \
> > -drive file=/home/song/8_riscv/fedora/adisk.qcow2,format=qcow2,id=hd1 \
> > -device virtio-blk-device,drive=hd1 \
> > -gdb tcp::1236 #-S
> > ```
> > 3. execute hibernation
> >
> > ```sh
> > swapon /dev/vdb2 # this is my swap disk
> >
> > echo disk > /sys/power/state
> > ```
> >
> > 4. Then you will encounter the kernel panic logged in the above link
> >
> >
> > Other Information:
> >
> > After my initial and incomplete dig-up, the commit (3335068f8721
> > "riscv: Use PUD/P4D/PGD pages for the linear mapping")[2]
> > is closely related to this panic. This commit uses re-defined
> > `MIN_MEMBLOCK_ADDR` to discover the entire system memory
> > and extends the `va_pa_offset` from `kernel_map.phys_addr` to
> > `phys_ram_base` for linear memory mapping.
> >
> > If the firmware delivered the firmware memory region (like: a PMP
> > protected region in OpenSbi) without "no-map" propriety,
> > this commit will result in firmware memory being directly mapped by
> > `create_linear_mapping_page_table()`.
> >
> > We can see the mapping via ptdump :
> > ```c
> > ---[ Linear mapping ]---
> > 0xff60000000000000-0xff60000000200000 0x0000000080000000 2M PMD D A G
> > . . W R V ------------- the firmware memory
> > 0xff60000000200000-0xff60000000c00000 0x0000000080200000 10M PMD D A G . . . R V
> > 0xff60000000c00000-0xff60000001000000 0x0000000080c00000 4M PMD D A G . . W R V
> > 0xff60000001000000-0xff60000001600000 0x0000000081000000 6M PMD D A G . . . R V
> > 0xff60000001600000-0xff60000040000000 0x0000000081600000 1002M PMD D A
> > G . . W R V
> > 0xff60000040000000-0xff60000100000000 0x00000000c0000000 3G PUD D A G . . W R V
> > ---[ Modules/BPF mapping ]---
> > ---[ Kernel mapping ]---
> > 0xffffffff80000000-0xffffffff80a00000 0x0000000080200000 10M PMD D A G . X . R V
> > 0xffffffff80a00000-0xffffffff80c00000 0x0000000080c00000 2M PMD D A G . . . R V
> > 0xffffffff80c00000-0xffffffff80e00000 0x0000000080e00000 2M PMD D A G . . W R V
> > 0xffffffff80e00000-0xffffffff81400000 0x0000000081000000 6M PMD D A G . . . R V
> > 0xffffffff81400000-0xffffffff81800000 0x0000000081600000 4M PMD
> > ```
> >
> > In the hibernation process, `swsusp_save()` calls
> > `copy_data_pages(&copy_bm, &orig_bm)` to copy these two memory
> > bitmaps,
> > the Oops(load access fault) occurred while copying the page of
> > PAGE_OFFSET (which maps the firmware memory).
> >
> > I also did two other tests:
> > Test1:
> >
> > The hibernation works well in the kernel with the commit 3335068f8721
> > reverted at least in the current environment.
> >
> > Test2:
> >
> > I built a simple kernel module to simulate the access of the value of
> > `PAGE_OFFSET` address, and the same panic occurred with the load
> > access fault.
> > So hibernation seems not the only case to trigger this panic.
> >
> > Finally, should we always leave the firmware memory with
> > `MEMBLOCK_NOMAP` flag by some efforts from Linux or OpenSbi (at least
> > in the current environment) or any other suggestions?
> >
> > Please correct me if I'm wrong.
> >
> > [1]: https://lore.kernel.org/r/[email protected]
> > [2]: https://lore.kernel.org/r/[email protected]
> >
> > --
> > Thanks,
> > Song

2023-05-16 11:47:22

by Conor Dooley

[permalink] [raw]
Subject: Re: Bug report: kernel paniced when system hibernates

On Tue, May 16, 2023 at 09:24:07AM +0000, Song Shuai wrote:
> Description of problem:
>
> The latest hibernation support[1] of RISC-V Linux produced a kernel panic.
> The entire log has been posted at this link: https://termbin.com/sphl .

Since the reporting is a bit scattered between here and the sw-dev list:

#regzbot introduced: v6.4-rc1.. ^
#regzbot title: kernel accessing opensbi reserved memory


Attachments:
(No filename) (416.00 B)
signature.asc (235.00 B)
Download all attachments

2023-05-16 12:00:17

by Alexandre Ghiti

[permalink] [raw]
Subject: Re: Bug report: kernel paniced when system hibernates

Hi Song,

On Tue, May 16, 2023 at 11:24 AM Song Shuai <[email protected]> wrote:
>
> Description of problem:
>
> The latest hibernation support[1] of RISC-V Linux produced a kernel panic.
> The entire log has been posted at this link: https://termbin.com/sphl .
>
> How reproducible:
>
> You can reproduce it with the following step :
>
> 1. prepare the environment with
> - Qemu-virt v8.0.0 (with OpenSbi v1.2)
> - Linux v6.4-rc1
>
> 2. start the Qemu virt
> ```sh
> $ cat ~/8_riscv/start_latest.sh
> #!/bin/bash
> /home/song/8_riscv/3_acpi/qemu/ooo/usr/local/bin/qemu-system-riscv64 \
> -smp 2 -m 4G -nographic -machine virt \
> -kernel /home/song/9_linux/linux/00_rv_test/arch/riscv/boot/Image \
> -append "root=/dev/vda ro eaylycon=uart8250,mmio,0x10000000
> early_ioremap_debug console=ttyS0 loglevel=8 memblock=debug
> no_console_suspend audit=0 3" \
> -drive file=/home/song/8_riscv/fedora/stage4-disk.img,format=raw,id=hd0 \
> -device virtio-blk-device,drive=hd0 \
> -drive file=/home/song/8_riscv/fedora/adisk.qcow2,format=qcow2,id=hd1 \
> -device virtio-blk-device,drive=hd1 \
> -gdb tcp::1236 #-S
> ```
> 3. execute hibernation
>
> ```sh
> swapon /dev/vdb2 # this is my swap disk
>
> echo disk > /sys/power/state
> ```
>
> 4. Then you will encounter the kernel panic logged in the above link
>
>
> Other Information:
>
> After my initial and incomplete dig-up, the commit (3335068f8721
> "riscv: Use PUD/P4D/PGD pages for the linear mapping")[2]
> is closely related to this panic. This commit uses re-defined
> `MIN_MEMBLOCK_ADDR` to discover the entire system memory
> and extends the `va_pa_offset` from `kernel_map.phys_addr` to
> `phys_ram_base` for linear memory mapping.
>
> If the firmware delivered the firmware memory region (like: a PMP
> protected region in OpenSbi) without "no-map" propriety,
> this commit will result in firmware memory being directly mapped by
> `create_linear_mapping_page_table()`.
>
> We can see the mapping via ptdump :
> ```c
> ---[ Linear mapping ]---
> 0xff60000000000000-0xff60000000200000 0x0000000080000000 2M PMD D A G
> . . W R V ------------- the firmware memory
> 0xff60000000200000-0xff60000000c00000 0x0000000080200000 10M PMD D A G . . . R V
> 0xff60000000c00000-0xff60000001000000 0x0000000080c00000 4M PMD D A G . . W R V
> 0xff60000001000000-0xff60000001600000 0x0000000081000000 6M PMD D A G . . . R V
> 0xff60000001600000-0xff60000040000000 0x0000000081600000 1002M PMD D A
> G . . W R V
> 0xff60000040000000-0xff60000100000000 0x00000000c0000000 3G PUD D A G . . W R V
> ---[ Modules/BPF mapping ]---
> ---[ Kernel mapping ]---
> 0xffffffff80000000-0xffffffff80a00000 0x0000000080200000 10M PMD D A G . X . R V
> 0xffffffff80a00000-0xffffffff80c00000 0x0000000080c00000 2M PMD D A G . . . R V
> 0xffffffff80c00000-0xffffffff80e00000 0x0000000080e00000 2M PMD D A G . . W R V
> 0xffffffff80e00000-0xffffffff81400000 0x0000000081000000 6M PMD D A G . . . R V
> 0xffffffff81400000-0xffffffff81800000 0x0000000081600000 4M PMD
> ```
>
> In the hibernation process, `swsusp_save()` calls
> `copy_data_pages(&copy_bm, &orig_bm)` to copy these two memory
> bitmaps,
> the Oops(load access fault) occurred while copying the page of
> PAGE_OFFSET (which maps the firmware memory).

I'm not saying that the hibernation process is in fault here, but
that's weird that it is trying to access pages that are not available
to the kernel: this region is mapped in the page table so that we can
use a 1GB page, but it is reserved so that it is not added to the
kernel memory pool.

>
> I also did two other tests:
> Test1:
>
> The hibernation works well in the kernel with the commit 3335068f8721
> reverted at least in the current environment.
>
> Test2:
>
> I built a simple kernel module to simulate the access of the value of
> `PAGE_OFFSET` address, and the same panic occurred with the load
> access fault.
> So hibernation seems not the only case to trigger this panic.
>
> Finally, should we always leave the firmware memory with
> `MEMBLOCK_NOMAP` flag by some efforts from Linux or OpenSbi (at least
> in the current environment) or any other suggestions?
>

I actually removed this flag a few years ago, and I have to admit that
I need to check if that's necessary: the goal of commit 3335068f8721
("riscv: Use PUD/P4D/PGD pages for the linear mapping") is to expose
the "right" start of DRAM so that we can align virtual and physical
addresses on a 1GB boundary.

So I have to check if a nomap region is actually added as a
memblock.memory.regions[] or not: if yes, that's perfect, let's add
the nomap attributes to the PMP regions, otherwise, I don't think that
is a good solution.

And a last word: Mike Rapoport recently gave a speech [1] where he
states that mapping the linear mapping with hugepages does not give
rise to better performance so *maybe* reverting this commit may be a
solution too as it may not provide the expected benefits (even though
I'd rather have it and another benefit of mapping the linear mapping
with 1GB hugepages is that it is faster to boot, but that needs to be
measured).

[1] https://lwn.net/Articles/931406/

> Please correct me if I'm wrong.
>
> [1]: https://lore.kernel.org/r/[email protected]
> [2]: https://lore.kernel.org/r/[email protected]
>
> --
> Thanks,
> Song

Thanks for the thorough report!

Alex

2023-05-17 09:06:11

by Alexandre Ghiti

[permalink] [raw]
Subject: Re: Bug report: kernel paniced when system hibernates

On Tue, May 16, 2023 at 1:27 PM JeeHeng Sia
<[email protected]> wrote:
>
>
>
> > -----Original Message-----
> > From: Alexandre Ghiti <[email protected]>
> > Sent: Tuesday, May 16, 2023 7:15 PM
> > To: JeeHeng Sia <[email protected]>
> > Cc: Song Shuai <[email protected]>; [email protected]; Andrew Jones <[email protected]>; [email protected];
> > [email protected]; Leyfoon Tan <[email protected]>; Mason Huo <[email protected]>; Paul Walmsley
> > <[email protected]>; Conor Dooley <[email protected]>; Guo Ren <[email protected]>; linux-
> > [email protected]; [email protected]
> > Subject: Re: Bug report: kernel paniced when system hibernates
> >
> > Hi JeeHeng,
> >
> > On Tue, May 16, 2023 at 11:55 AM JeeHeng Sia
> > <[email protected]> wrote:
> > >
> > > Hi Song,
> > >
> > > Thanks for the investigation. Indeed, the exposure of the PMP reserved region to the kernel page table is causing the problem.
> > > Here is the similar report: https://groups.google.com/u/0/a/groups.riscv.org/g/sw-dev/c/ITXwaKfA6z8
> >
> > IMO, we should discuss the kernel related stuff on the linux riscv ML,
> > I'm not subscribed to the group above and you did not answer my last
> > direct emails regarding this problem either.
> Hi Alex, it's strange that I haven't received a reply from you.

I should stop using my personal email as it gets more and more
blocked, I don't know why.

> Seems like I have forgot to update to the mailing list.
> By the way, the reason hibernation failed with the recent page table is because during the initiation of the hibernation process, the hibernation core accesses the page table to check if the page can be saved. Since the page is reserved and is available to the page table, the kernel tries to access the PMP region, which causes a kernel panic. The code can be found under:
> /kernel/power/snapshot.c/ line #1340
> if (PageReserved(page) && (!kernel_page_present(page) || pfn_is_nosave(pfn)))

Thanks for this pointer, it really helps!

> If the virtual memory is mapped to the kernel page table, then hibernation core will try to access the page.
>
> > Thanks,
> >
> > Alex
> >
> > >
> > > Thanks
> > > Regards
> > > Jee Heng

Is your issue also related to hibernation?

> > >
> > > > -----Original Message-----
> > > > From: Song Shuai <[email protected]>
> > > > Sent: Tuesday, May 16, 2023 5:24 PM
> > > > To: [email protected]; [email protected]; Andrew Jones <[email protected]>; [email protected];
> > > > [email protected]; JeeHeng Sia <[email protected]>; Leyfoon Tan <[email protected]>; Mason Huo
> > > > <[email protected]>; Paul Walmsley <[email protected]>; Conor Dooley <[email protected]>;
> > Guo
> > > > Ren <[email protected]>
> > > > Cc: [email protected]; [email protected]
> > > > Subject: Bug report: kernel paniced when system hibernates
> > > >
> > > > Description of problem:
> > > >
> > > > The latest hibernation support[1] of RISC-V Linux produced a kernel panic.
> > > > The entire log has been posted at this link: https://termbin.com/sphl .
> > > >
> > > > How reproducible:
> > > >
> > > > You can reproduce it with the following step :
> > > >
> > > > 1. prepare the environment with
> > > > - Qemu-virt v8.0.0 (with OpenSbi v1.2)
> > > > - Linux v6.4-rc1
> > > >
> > > > 2. start the Qemu virt
> > > > ```sh
> > > > $ cat ~/8_riscv/start_latest.sh
> > > > #!/bin/bash
> > > > /home/song/8_riscv/3_acpi/qemu/ooo/usr/local/bin/qemu-system-riscv64 \
> > > > -smp 2 -m 4G -nographic -machine virt \
> > > > -kernel /home/song/9_linux/linux/00_rv_test/arch/riscv/boot/Image \
> > > > -append "root=/dev/vda ro eaylycon=uart8250,mmio,0x10000000
> > > > early_ioremap_debug console=ttyS0 loglevel=8 memblock=debug
> > > > no_console_suspend audit=0 3" \
> > > > -drive file=/home/song/8_riscv/fedora/stage4-disk.img,format=raw,id=hd0 \
> > > > -device virtio-blk-device,drive=hd0 \
> > > > -drive file=/home/song/8_riscv/fedora/adisk.qcow2,format=qcow2,id=hd1 \
> > > > -device virtio-blk-device,drive=hd1 \
> > > > -gdb tcp::1236 #-S
> > > > ```
> > > > 3. execute hibernation
> > > >
> > > > ```sh
> > > > swapon /dev/vdb2 # this is my swap disk
> > > >
> > > > echo disk > /sys/power/state
> > > > ```
> > > >
> > > > 4. Then you will encounter the kernel panic logged in the above link
> > > >
> > > >
> > > > Other Information:
> > > >
> > > > After my initial and incomplete dig-up, the commit (3335068f8721
> > > > "riscv: Use PUD/P4D/PGD pages for the linear mapping")[2]
> > > > is closely related to this panic. This commit uses re-defined
> > > > `MIN_MEMBLOCK_ADDR` to discover the entire system memory
> > > > and extends the `va_pa_offset` from `kernel_map.phys_addr` to
> > > > `phys_ram_base` for linear memory mapping.
> > > >
> > > > If the firmware delivered the firmware memory region (like: a PMP
> > > > protected region in OpenSbi) without "no-map" propriety,
> > > > this commit will result in firmware memory being directly mapped by
> > > > `create_linear_mapping_page_table()`.
> > > >
> > > > We can see the mapping via ptdump :
> > > > ```c
> > > > ---[ Linear mapping ]---
> > > > 0xff60000000000000-0xff60000000200000 0x0000000080000000 2M PMD D A G
> > > > . . W R V ------------- the firmware memory
> > > > 0xff60000000200000-0xff60000000c00000 0x0000000080200000 10M PMD D A G . . . R V
> > > > 0xff60000000c00000-0xff60000001000000 0x0000000080c00000 4M PMD D A G . . W R V
> > > > 0xff60000001000000-0xff60000001600000 0x0000000081000000 6M PMD D A G . . . R V
> > > > 0xff60000001600000-0xff60000040000000 0x0000000081600000 1002M PMD D A
> > > > G . . W R V
> > > > 0xff60000040000000-0xff60000100000000 0x00000000c0000000 3G PUD D A G . . W R V
> > > > ---[ Modules/BPF mapping ]---
> > > > ---[ Kernel mapping ]---
> > > > 0xffffffff80000000-0xffffffff80a00000 0x0000000080200000 10M PMD D A G . X . R V
> > > > 0xffffffff80a00000-0xffffffff80c00000 0x0000000080c00000 2M PMD D A G . . . R V
> > > > 0xffffffff80c00000-0xffffffff80e00000 0x0000000080e00000 2M PMD D A G . . W R V
> > > > 0xffffffff80e00000-0xffffffff81400000 0x0000000081000000 6M PMD D A G . . . R V
> > > > 0xffffffff81400000-0xffffffff81800000 0x0000000081600000 4M PMD
> > > > ```
> > > >
> > > > In the hibernation process, `swsusp_save()` calls
> > > > `copy_data_pages(&copy_bm, &orig_bm)` to copy these two memory
> > > > bitmaps,
> > > > the Oops(load access fault) occurred while copying the page of
> > > > PAGE_OFFSET (which maps the firmware memory).
> > > >
> > > > I also did two other tests:
> > > > Test1:
> > > >
> > > > The hibernation works well in the kernel with the commit 3335068f8721
> > > > reverted at least in the current environment.
> > > >
> > > > Test2:
> > > >
> > > > I built a simple kernel module to simulate the access of the value of
> > > > `PAGE_OFFSET` address, and the same panic occurred with the load
> > > > access fault.
> > > > So hibernation seems not the only case to trigger this panic.
> > > >
> > > > Finally, should we always leave the firmware memory with
> > > > `MEMBLOCK_NOMAP` flag by some efforts from Linux or OpenSbi (at least
> > > > in the current environment) or any other suggestions?
> > > >
> > > > Please correct me if I'm wrong.
> > > >
> > > > [1]: https://lore.kernel.org/r/[email protected]
> > > > [2]: https://lore.kernel.org/r/[email protected]
> > > >
> > > > --
> > > > Thanks,
> > > > Song

2023-05-17 09:26:52

by Alexandre Ghiti

[permalink] [raw]
Subject: Re: Bug report: kernel paniced when system hibernates

On Tue, May 16, 2023 at 1:12 PM Alexandre Ghiti <[email protected]> wrote:
>
> Hi Song,
>
> On Tue, May 16, 2023 at 11:24 AM Song Shuai <[email protected]> wrote:
> >
> > Description of problem:
> >
> > The latest hibernation support[1] of RISC-V Linux produced a kernel panic.
> > The entire log has been posted at this link: https://termbin.com/sphl .
> >
> > How reproducible:
> >
> > You can reproduce it with the following step :
> >
> > 1. prepare the environment with
> > - Qemu-virt v8.0.0 (with OpenSbi v1.2)
> > - Linux v6.4-rc1
> >
> > 2. start the Qemu virt
> > ```sh
> > $ cat ~/8_riscv/start_latest.sh
> > #!/bin/bash
> > /home/song/8_riscv/3_acpi/qemu/ooo/usr/local/bin/qemu-system-riscv64 \
> > -smp 2 -m 4G -nographic -machine virt \
> > -kernel /home/song/9_linux/linux/00_rv_test/arch/riscv/boot/Image \
> > -append "root=/dev/vda ro eaylycon=uart8250,mmio,0x10000000
> > early_ioremap_debug console=ttyS0 loglevel=8 memblock=debug
> > no_console_suspend audit=0 3" \
> > -drive file=/home/song/8_riscv/fedora/stage4-disk.img,format=raw,id=hd0 \
> > -device virtio-blk-device,drive=hd0 \
> > -drive file=/home/song/8_riscv/fedora/adisk.qcow2,format=qcow2,id=hd1 \
> > -device virtio-blk-device,drive=hd1 \
> > -gdb tcp::1236 #-S
> > ```
> > 3. execute hibernation
> >
> > ```sh
> > swapon /dev/vdb2 # this is my swap disk
> >
> > echo disk > /sys/power/state
> > ```
> >
> > 4. Then you will encounter the kernel panic logged in the above link
> >
> >
> > Other Information:
> >
> > After my initial and incomplete dig-up, the commit (3335068f8721
> > "riscv: Use PUD/P4D/PGD pages for the linear mapping")[2]
> > is closely related to this panic. This commit uses re-defined
> > `MIN_MEMBLOCK_ADDR` to discover the entire system memory
> > and extends the `va_pa_offset` from `kernel_map.phys_addr` to
> > `phys_ram_base` for linear memory mapping.
> >
> > If the firmware delivered the firmware memory region (like: a PMP
> > protected region in OpenSbi) without "no-map" propriety,
> > this commit will result in firmware memory being directly mapped by
> > `create_linear_mapping_page_table()`.
> >
> > We can see the mapping via ptdump :
> > ```c
> > ---[ Linear mapping ]---
> > 0xff60000000000000-0xff60000000200000 0x0000000080000000 2M PMD D A G
> > . . W R V ------------- the firmware memory
> > 0xff60000000200000-0xff60000000c00000 0x0000000080200000 10M PMD D A G . . . R V
> > 0xff60000000c00000-0xff60000001000000 0x0000000080c00000 4M PMD D A G . . W R V
> > 0xff60000001000000-0xff60000001600000 0x0000000081000000 6M PMD D A G . . . R V
> > 0xff60000001600000-0xff60000040000000 0x0000000081600000 1002M PMD D A
> > G . . W R V
> > 0xff60000040000000-0xff60000100000000 0x00000000c0000000 3G PUD D A G . . W R V
> > ---[ Modules/BPF mapping ]---
> > ---[ Kernel mapping ]---
> > 0xffffffff80000000-0xffffffff80a00000 0x0000000080200000 10M PMD D A G . X . R V
> > 0xffffffff80a00000-0xffffffff80c00000 0x0000000080c00000 2M PMD D A G . . . R V
> > 0xffffffff80c00000-0xffffffff80e00000 0x0000000080e00000 2M PMD D A G . . W R V
> > 0xffffffff80e00000-0xffffffff81400000 0x0000000081000000 6M PMD D A G . . . R V
> > 0xffffffff81400000-0xffffffff81800000 0x0000000081600000 4M PMD
> > ```
> >
> > In the hibernation process, `swsusp_save()` calls
> > `copy_data_pages(&copy_bm, &orig_bm)` to copy these two memory
> > bitmaps,
> > the Oops(load access fault) occurred while copying the page of
> > PAGE_OFFSET (which maps the firmware memory).
>
> I'm not saying that the hibernation process is in fault here, but
> that's weird that it is trying to access pages that are not available
> to the kernel: this region is mapped in the page table so that we can
> use a 1GB page, but it is reserved so that it is not added to the
> kernel memory pool.
>
> >
> > I also did two other tests:
> > Test1:
> >
> > The hibernation works well in the kernel with the commit 3335068f8721
> > reverted at least in the current environment.
> >
> > Test2:
> >
> > I built a simple kernel module to simulate the access of the value of
> > `PAGE_OFFSET` address, and the same panic occurred with the load
> > access fault.
> > So hibernation seems not the only case to trigger this panic.
> >
> > Finally, should we always leave the firmware memory with
> > `MEMBLOCK_NOMAP` flag by some efforts from Linux or OpenSbi (at least
> > in the current environment) or any other suggestions?
> >
>
> I actually removed this flag a few years ago, and I have to admit that
> I need to check if that's necessary: the goal of commit 3335068f8721
> ("riscv: Use PUD/P4D/PGD pages for the linear mapping") is to expose
> the "right" start of DRAM so that we can align virtual and physical
> addresses on a 1GB boundary.
>
> So I have to check if a nomap region is actually added as a
> memblock.memory.regions[] or not: if yes, that's perfect, let's add
> the nomap attributes to the PMP regions, otherwise, I don't think that
> is a good solution.

So here is the current linear mapping without nomap in openSBI:

---[ Linear mapping ]---
0xff60000000000000-0xff60000000200000 0x0000000080000000 2M
PMD D A G . . W R V
0xff60000000200000-0xff60000000e00000 0x0000000080200000 12M
PMD D A G . . . R V

And below the linear mapping with nomap in openSBI:

---[ Linear mapping ]---
0xff60000000080000-0xff60000000200000 0x0000000080080000 1536K
PTE D A G . . W R V
0xff60000000200000-0xff60000000e00000 0x0000000080200000 12M
PMD D A G . . . R V

So adding nomap does not misalign virtual and physical addresses, it
prevents the usage of 1GB page for this area though, so that's a
solution, we just lose this 1GB page here.

But even though that may be the fix, I think we also need to fix that
in the kernel as it would break compatibility with certain versions of
openSBI *if* we fix openSBI...So here are a few solutions:

1. we can mark all "mmode_resv" nodes in the device tree as nomap,
before the linear mapping is established (IIUC, those nodes are added
by openSBI to advertise PMP regions)
-> This amounts to the same fix as opensbi and we lose the 1GB hugepage.
2. we can tweak pfn_is_nosave function to *not* save pfn corresponding
to PMP regions
-> We don't lose the 1GB hugepage \o/
3. we can use register_nosave_region() to not save the "mmode_resv"
regions (x86 does that
https://elixir.bootlin.com/linux/v6.4-rc1/source/arch/x86/kernel/e820.c#L753)
-> We don't lose the 1GB hugepage \o/
4. Given JeeHeng pointer to
https://elixir.bootlin.com/linux/v6.4-rc1/source/kernel/power/snapshot.c#L1340,
we can mark those pages as non-readable and make the hibernation
process not save those pages
-> Very late-in-the-day idea, not sure what it's worth, we also
lose the 1GB hugepage...

To me, the best solution is 3 as it would prepare for other similar
issues later, it is similar to x86 and it allows us to keep 1GB
hugepages.

I have been thinking, and to me nomap does not provide anything since
the kernel should not address this memory range, so if it does, we
must fix the kernel.

Let me know what you all think, I'll be preparing a PoC of 3 in the meantime!

Alex



>
> And a last word: Mike Rapoport recently gave a speech [1] where he
> states that mapping the linear mapping with hugepages does not give
> rise to better performance so *maybe* reverting this commit may be a
> solution too as it may not provide the expected benefits (even though
> I'd rather have it and another benefit of mapping the linear mapping
> with 1GB hugepages is that it is faster to boot, but that needs to be
> measured).
>
> [1] https://lwn.net/Articles/931406/
>
> > Please correct me if I'm wrong.
> >
> > [1]: https://lore.kernel.org/r/[email protected]
> > [2]: https://lore.kernel.org/r/[email protected]
> >
> > --
> > Thanks,
> > Song
>
> Thanks for the thorough report!
>
> Alex

2023-05-17 11:11:52

by Song Shuai

[permalink] [raw]
Subject: Re: Bug report: kernel paniced when system hibernates

Alexandre Ghiti <[email protected]> 于2023年5月17日周三 08:58写道:
>
> On Tue, May 16, 2023 at 1:12 PM Alexandre Ghiti <[email protected]> wrote:
> >
> > Hi Song,
> >
> > On Tue, May 16, 2023 at 11:24 AM Song Shuai <[email protected]> wrote:
> > >
> > > Description of problem:
> > >
> > > The latest hibernation support[1] of RISC-V Linux produced a kernel panic.
> > > The entire log has been posted at this link: https://termbin.com/sphl .
> > >
> > > How reproducible:
> > >
> > > You can reproduce it with the following step :
> > >
> > > 1. prepare the environment with
> > > - Qemu-virt v8.0.0 (with OpenSbi v1.2)
> > > - Linux v6.4-rc1
> > >
> > > 2. start the Qemu virt
> > > ```sh
> > > $ cat ~/8_riscv/start_latest.sh
> > > #!/bin/bash
> > > /home/song/8_riscv/3_acpi/qemu/ooo/usr/local/bin/qemu-system-riscv64 \
> > > -smp 2 -m 4G -nographic -machine virt \
> > > -kernel /home/song/9_linux/linux/00_rv_test/arch/riscv/boot/Image \
> > > -append "root=/dev/vda ro eaylycon=uart8250,mmio,0x10000000
> > > early_ioremap_debug console=ttyS0 loglevel=8 memblock=debug
> > > no_console_suspend audit=0 3" \
> > > -drive file=/home/song/8_riscv/fedora/stage4-disk.img,format=raw,id=hd0 \
> > > -device virtio-blk-device,drive=hd0 \
> > > -drive file=/home/song/8_riscv/fedora/adisk.qcow2,format=qcow2,id=hd1 \
> > > -device virtio-blk-device,drive=hd1 \
> > > -gdb tcp::1236 #-S
> > > ```
> > > 3. execute hibernation
> > >
> > > ```sh
> > > swapon /dev/vdb2 # this is my swap disk
> > >
> > > echo disk > /sys/power/state
> > > ```
> > >
> > > 4. Then you will encounter the kernel panic logged in the above link
> > >
> > >
> > > Other Information:
> > >
> > > After my initial and incomplete dig-up, the commit (3335068f8721
> > > "riscv: Use PUD/P4D/PGD pages for the linear mapping")[2]
> > > is closely related to this panic. This commit uses re-defined
> > > `MIN_MEMBLOCK_ADDR` to discover the entire system memory
> > > and extends the `va_pa_offset` from `kernel_map.phys_addr` to
> > > `phys_ram_base` for linear memory mapping.
> > >
> > > If the firmware delivered the firmware memory region (like: a PMP
> > > protected region in OpenSbi) without "no-map" propriety,
> > > this commit will result in firmware memory being directly mapped by
> > > `create_linear_mapping_page_table()`.
> > >
> > > We can see the mapping via ptdump :
> > > ```c
> > > ---[ Linear mapping ]---
> > > 0xff60000000000000-0xff60000000200000 0x0000000080000000 2M PMD D A G
> > > . . W R V ------------- the firmware memory
> > > 0xff60000000200000-0xff60000000c00000 0x0000000080200000 10M PMD D A G . . . R V
> > > 0xff60000000c00000-0xff60000001000000 0x0000000080c00000 4M PMD D A G . . W R V
> > > 0xff60000001000000-0xff60000001600000 0x0000000081000000 6M PMD D A G . . . R V
> > > 0xff60000001600000-0xff60000040000000 0x0000000081600000 1002M PMD D A
> > > G . . W R V
> > > 0xff60000040000000-0xff60000100000000 0x00000000c0000000 3G PUD D A G . . W R V
> > > ---[ Modules/BPF mapping ]---
> > > ---[ Kernel mapping ]---
> > > 0xffffffff80000000-0xffffffff80a00000 0x0000000080200000 10M PMD D A G . X . R V
> > > 0xffffffff80a00000-0xffffffff80c00000 0x0000000080c00000 2M PMD D A G . . . R V
> > > 0xffffffff80c00000-0xffffffff80e00000 0x0000000080e00000 2M PMD D A G . . W R V
> > > 0xffffffff80e00000-0xffffffff81400000 0x0000000081000000 6M PMD D A G . . . R V
> > > 0xffffffff81400000-0xffffffff81800000 0x0000000081600000 4M PMD
> > > ```
> > >
> > > In the hibernation process, `swsusp_save()` calls
> > > `copy_data_pages(&copy_bm, &orig_bm)` to copy these two memory
> > > bitmaps,
> > > the Oops(load access fault) occurred while copying the page of
> > > PAGE_OFFSET (which maps the firmware memory).
> >
> > I'm not saying that the hibernation process is in fault here, but
> > that's weird that it is trying to access pages that are not available
> > to the kernel: this region is mapped in the page table so that we can
> > use a 1GB page, but it is reserved so that it is not added to the
> > kernel memory pool.
Yes, my fault, the Test2 is not a correct testcase.
> >
> > >
> > > I also did two other tests:
> > > Test1:
> > >
> > > The hibernation works well in the kernel with the commit 3335068f8721
> > > reverted at least in the current environment.
> > >
> > > Test2:
> > >
> > > I built a simple kernel module to simulate the access of the value of
> > > `PAGE_OFFSET` address, and the same panic occurred with the load
> > > access fault.
> > > So hibernation seems not the only case to trigger this panic.
> > >
> > > Finally, should we always leave the firmware memory with
> > > `MEMBLOCK_NOMAP` flag by some efforts from Linux or OpenSbi (at least
> > > in the current environment) or any other suggestions?
> > >
> >
> > I actually removed this flag a few years ago, and I have to admit that
> > I need to check if that's necessary: the goal of commit 3335068f8721
> > ("riscv: Use PUD/P4D/PGD pages for the linear mapping") is to expose
> > the "right" start of DRAM so that we can align virtual and physical
> > addresses on a 1GB boundary.
> >
> > So I have to check if a nomap region is actually added as a
> > memblock.memory.regions[] or not: if yes, that's perfect, let's add
> > the nomap attributes to the PMP regions, otherwise, I don't think that
> > is a good solution.
>
> So here is the current linear mapping without nomap in openSBI:
>
> ---[ Linear mapping ]---
> 0xff60000000000000-0xff60000000200000 0x0000000080000000 2M
> PMD D A G . . W R V
> 0xff60000000200000-0xff60000000e00000 0x0000000080200000 12M
> PMD D A G . . . R V
>
> And below the linear mapping with nomap in openSBI:
>
> ---[ Linear mapping ]---
> 0xff60000000080000-0xff60000000200000 0x0000000080080000 1536K
> PTE D A G . . W R V
> 0xff60000000200000-0xff60000000e00000 0x0000000080200000 12M
> PMD D A G . . . R V
>
> So adding nomap does not misalign virtual and physical addresses, it
> prevents the usage of 1GB page for this area though, so that's a
> solution, we just lose this 1GB page here.
>
> But even though that may be the fix, I think we also need to fix that
> in the kernel as it would break compatibility with certain versions of
> openSBI *if* we fix openSBI...So here are a few solutions:
>
> 1. we can mark all "mmode_resv" nodes in the device tree as nomap,
> before the linear mapping is established (IIUC, those nodes are added
> by openSBI to advertise PMP regions)
> -> This amounts to the same fix as opensbi and we lose the 1GB hugepage.
> 2. we can tweak pfn_is_nosave function to *not* save pfn corresponding
> to PMP regions
> -> We don't lose the 1GB hugepage \o/
> 3. we can use register_nosave_region() to not save the "mmode_resv"
> regions (x86 does that
> https://elixir.bootlin.com/linux/v6.4-rc1/source/arch/x86/kernel/e820.c#L753)
> -> We don't lose the 1GB hugepage \o/
> 4. Given JeeHeng pointer to
> https://elixir.bootlin.com/linux/v6.4-rc1/source/kernel/power/snapshot.c#L1340,
> we can mark those pages as non-readable and make the hibernation
> process not save those pages
> -> Very late-in-the-day idea, not sure what it's worth, we also
> lose the 1GB hugepage...
>
> To me, the best solution is 3 as it would prepare for other similar
> issues later, it is similar to x86 and it allows us to keep 1GB
> hugepages.

I agree,
register_nosave_region() is a good way in the early initialization to
set page frames (like the PMP regions) in forbidden_pages_map and mark
them as no-savable for hibernation.

Look forward to your fixing.
>
> I have been thinking, and to me nomap does not provide anything since
> the kernel should not address this memory range, so if it does, we
> must fix the kernel.
>
> Let me know what you all think, I'll be preparing a PoC of 3 in the meantime!
>
> Alex
>
>
>
> >
> > And a last word: Mike Rapoport recently gave a speech [1] where he
> > states that mapping the linear mapping with hugepages does not give
> > rise to better performance so *maybe* reverting this commit may be a
> > solution too as it may not provide the expected benefits (even though
> > I'd rather have it and another benefit of mapping the linear mapping
> > with 1GB hugepages is that it is faster to boot, but that needs to be
> > measured).
> >
> > [1] https://lwn.net/Articles/931406/
> >
> > > Please correct me if I'm wrong.
> > >
> > > [1]: https://lore.kernel.org/r/[email protected]
> > > [2]: https://lore.kernel.org/r/[email protected]
> > >
> > > --
> > > Thanks,
> > > Song
> >
> > Thanks for the thorough report!
> >
> > Alex



--
Thanks,
Song

2023-05-17 11:51:37

by Conor Dooley

[permalink] [raw]
Subject: Re: Bug report: kernel paniced when system hibernates

Hey Alex,

On Wed, May 17, 2023 at 10:58:02AM +0200, Alexandre Ghiti wrote:
> On Tue, May 16, 2023 at 1:12 PM Alexandre Ghiti <[email protected]> wrote:

> > On Tue, May 16, 2023 at 11:24 AM Song Shuai <[email protected]> wrote:
> > I actually removed this flag a few years ago, and I have to admit that
> > I need to check if that's necessary: the goal of commit 3335068f8721
> > ("riscv: Use PUD/P4D/PGD pages for the linear mapping") is to expose
> > the "right" start of DRAM so that we can align virtual and physical
> > addresses on a 1GB boundary.
> >
> > So I have to check if a nomap region is actually added as a
> > memblock.memory.regions[] or not: if yes, that's perfect, let's add
> > the nomap attributes to the PMP regions, otherwise, I don't think that
> > is a good solution.
>
> So here is the current linear mapping without nomap in openSBI:
>
> ---[ Linear mapping ]---
> 0xff60000000000000-0xff60000000200000 0x0000000080000000 2M
> PMD D A G . . W R V
> 0xff60000000200000-0xff60000000e00000 0x0000000080200000 12M
> PMD D A G . . . R V
>
> And below the linear mapping with nomap in openSBI:
>
> ---[ Linear mapping ]---
> 0xff60000000080000-0xff60000000200000 0x0000000080080000 1536K
> PTE D A G . . W R V
> 0xff60000000200000-0xff60000000e00000 0x0000000080200000 12M
> PMD D A G . . . R V
>
> So adding nomap does not misalign virtual and physical addresses, it
> prevents the usage of 1GB page for this area though, so that's a
> solution, we just lose this 1GB page here.
>
> But even though that may be the fix, I think we also need to fix that
> in the kernel as it would break compatibility with certain versions of
> openSBI *if* we fix openSBI...So here are a few solutions:
>
> 1. we can mark all "mmode_resv" nodes in the device tree as nomap,
> before the linear mapping is established (IIUC, those nodes are added
> by openSBI to advertise PMP regions)
> -> This amounts to the same fix as opensbi and we lose the 1GB hugepage.

AFAIU, losing the 1 GB hugepage is a regression, which would make this
not an option, right?

> 2. we can tweak pfn_is_nosave function to *not* save pfn corresponding
> to PMP regions
> -> We don't lose the 1GB hugepage \o/
> 3. we can use register_nosave_region() to not save the "mmode_resv"
> regions (x86 does that
> https://elixir.bootlin.com/linux/v6.4-rc1/source/arch/x86/kernel/e820.c#L753)
> -> We don't lose the 1GB hugepage \o/
> 4. Given JeeHeng pointer to
> https://elixir.bootlin.com/linux/v6.4-rc1/source/kernel/power/snapshot.c#L1340,
> we can mark those pages as non-readable and make the hibernation
> process not save those pages
> -> Very late-in-the-day idea, not sure what it's worth, we also
> lose the 1GB hugepage...

Ditto here re: introducing another regression.

> To me, the best solution is 3 as it would prepare for other similar
> issues later, it is similar to x86 and it allows us to keep 1GB
> hugepages.
>
> I have been thinking, and to me nomap does not provide anything since
> the kernel should not address this memory range, so if it does, we
> must fix the kernel.
>
> Let me know what you all think, I'll be preparing a PoC of 3 in the meantime!

#3 would probably get my vote too. It seems like you could use it
dynamically if there was to be a future other provider of "mmode_resv"
regions, rather than doing something location-specific.

We should probably document these opensbi reserved memory nodes though
in a dt-binding or w/e if we are going to be relying on them to not
crash!

Thanks for working on this,
Conor.


Attachments:
(No filename) (3.63 kB)
signature.asc (235.00 B)
Download all attachments

2023-05-17 15:06:54

by Alexandre Ghiti

[permalink] [raw]
Subject: Re: Bug report: kernel paniced when system hibernates

On Wed, May 17, 2023 at 1:28 PM Conor Dooley <[email protected]> wrote:
>
> Hey Alex,
>
> On Wed, May 17, 2023 at 10:58:02AM +0200, Alexandre Ghiti wrote:
> > On Tue, May 16, 2023 at 1:12 PM Alexandre Ghiti <[email protected]> wrote:
>
> > > On Tue, May 16, 2023 at 11:24 AM Song Shuai <[email protected]> wrote:
> > > I actually removed this flag a few years ago, and I have to admit that
> > > I need to check if that's necessary: the goal of commit 3335068f8721
> > > ("riscv: Use PUD/P4D/PGD pages for the linear mapping") is to expose
> > > the "right" start of DRAM so that we can align virtual and physical
> > > addresses on a 1GB boundary.
> > >
> > > So I have to check if a nomap region is actually added as a
> > > memblock.memory.regions[] or not: if yes, that's perfect, let's add
> > > the nomap attributes to the PMP regions, otherwise, I don't think that
> > > is a good solution.
> >
> > So here is the current linear mapping without nomap in openSBI:
> >
> > ---[ Linear mapping ]---
> > 0xff60000000000000-0xff60000000200000 0x0000000080000000 2M
> > PMD D A G . . W R V
> > 0xff60000000200000-0xff60000000e00000 0x0000000080200000 12M
> > PMD D A G . . . R V
> >
> > And below the linear mapping with nomap in openSBI:
> >
> > ---[ Linear mapping ]---
> > 0xff60000000080000-0xff60000000200000 0x0000000080080000 1536K
> > PTE D A G . . W R V
> > 0xff60000000200000-0xff60000000e00000 0x0000000080200000 12M
> > PMD D A G . . . R V
> >
> > So adding nomap does not misalign virtual and physical addresses, it
> > prevents the usage of 1GB page for this area though, so that's a
> > solution, we just lose this 1GB page here.
> >
> > But even though that may be the fix, I think we also need to fix that
> > in the kernel as it would break compatibility with certain versions of
> > openSBI *if* we fix openSBI...So here are a few solutions:
> >
> > 1. we can mark all "mmode_resv" nodes in the device tree as nomap,
> > before the linear mapping is established (IIUC, those nodes are added
> > by openSBI to advertise PMP regions)
> > -> This amounts to the same fix as opensbi and we lose the 1GB hugepage.
>
> AFAIU, losing the 1 GB hugepage is a regression, which would make this
> not an option, right?

Not sure this is a real regression, I'd rather avoid it, but as
mentioned in my first answer, Mike Rapoport showed that it was making
no difference performance-wise...

>
> > 2. we can tweak pfn_is_nosave function to *not* save pfn corresponding
> > to PMP regions
> > -> We don't lose the 1GB hugepage \o/
> > 3. we can use register_nosave_region() to not save the "mmode_resv"
> > regions (x86 does that
> > https://elixir.bootlin.com/linux/v6.4-rc1/source/arch/x86/kernel/e820.c#L753)
> > -> We don't lose the 1GB hugepage \o/
> > 4. Given JeeHeng pointer to
> > https://elixir.bootlin.com/linux/v6.4-rc1/source/kernel/power/snapshot.c#L1340,
> > we can mark those pages as non-readable and make the hibernation
> > process not save those pages
> > -> Very late-in-the-day idea, not sure what it's worth, we also
> > lose the 1GB hugepage...
>
> Ditto here re: introducing another regression.
>
> > To me, the best solution is 3 as it would prepare for other similar
> > issues later, it is similar to x86 and it allows us to keep 1GB
> > hugepages.
> >
> > I have been thinking, and to me nomap does not provide anything since
> > the kernel should not address this memory range, so if it does, we
> > must fix the kernel.
> >
> > Let me know what you all think, I'll be preparing a PoC of 3 in the meantime!
>
> #3 would probably get my vote too. It seems like you could use it
> dynamically if there was to be a future other provider of "mmode_resv"
> regions, rather than doing something location-specific.
>
> We should probably document these opensbi reserved memory nodes though
> in a dt-binding or w/e if we are going to be relying on them to not
> crash!

Yes, you're right, let's see what Atish and Anup think!

Thanks for your quick answers Conor and Song, really appreciated!

Alex

>
> Thanks for working on this,
> Conor.
>

2023-05-18 04:24:32

by Sia Jee Heng

[permalink] [raw]
Subject: RE: Bug report: kernel paniced when system hibernates



> -----Original Message-----
> From: Alexandre Ghiti <[email protected]>
> Sent: Wednesday, May 17, 2023 4:33 PM
> To: JeeHeng Sia <[email protected]>
> Cc: Song Shuai <[email protected]>; [email protected]; Andrew Jones <[email protected]>; [email protected];
> [email protected]; Leyfoon Tan <[email protected]>; Mason Huo <[email protected]>; Paul Walmsley
> <[email protected]>; Conor Dooley <[email protected]>; Guo Ren <[email protected]>; linux-
> [email protected]; [email protected]
> Subject: Re: Bug report: kernel paniced when system hibernates
>
> On Tue, May 16, 2023 at 1:27 PM JeeHeng Sia
> <[email protected]> wrote:
> >
> >
> >
> > > -----Original Message-----
> > > From: Alexandre Ghiti <[email protected]>
> > > Sent: Tuesday, May 16, 2023 7:15 PM
> > > To: JeeHeng Sia <[email protected]>
> > > Cc: Song Shuai <[email protected]>; [email protected]; Andrew Jones <[email protected]>; [email protected];
> > > [email protected]; Leyfoon Tan <[email protected]>; Mason Huo <[email protected]>; Paul Walmsley
> > > <[email protected]>; Conor Dooley <[email protected]>; Guo Ren <[email protected]>; linux-
> > > [email protected]; [email protected]
> > > Subject: Re: Bug report: kernel paniced when system hibernates
> > >
> > > Hi JeeHeng,
> > >
> > > On Tue, May 16, 2023 at 11:55 AM JeeHeng Sia
> > > <[email protected]> wrote:
> > > >
> > > > Hi Song,
> > > >
> > > > Thanks for the investigation. Indeed, the exposure of the PMP reserved region to the kernel page table is causing the problem.
> > > > Here is the similar report: https://groups.google.com/u/0/a/groups.riscv.org/g/sw-dev/c/ITXwaKfA6z8
> > >
> > > IMO, we should discuss the kernel related stuff on the linux riscv ML,
> > > I'm not subscribed to the group above and you did not answer my last
> > > direct emails regarding this problem either.
> > Hi Alex, it's strange that I haven't received a reply from you.
>
> I should stop using my personal email as it gets more and more
> blocked, I don't know why.
>
> > Seems like I have forgot to update to the mailing list.
> > By the way, the reason hibernation failed with the recent page table is because during the initiation of the hibernation process, the
> hibernation core accesses the page table to check if the page can be saved. Since the page is reserved and is available to the page
> table, the kernel tries to access the PMP region, which causes a kernel panic. The code can be found under:
> > /kernel/power/snapshot.c/ line #1340
> > if (PageReserved(page) && (!kernel_page_present(page) || pfn_is_nosave(pfn)))
>
> Thanks for this pointer, it really helps!
np.
>
> > If the virtual memory is mapped to the kernel page table, then hibernation core will try to access the page.
> >
> > > Thanks,
> > >
> > > Alex
> > >
> > > >
> > > > Thanks
> > > > Regards
> > > > Jee Heng
>
> Is your issue also related to hibernation?
My issue is related to memory testing/debugging. I am glad to see more and more people interested in the RISC-V Hibernation solution.
>
> > > >
> > > > > -----Original Message-----
> > > > > From: Song Shuai <[email protected]>
> > > > > Sent: Tuesday, May 16, 2023 5:24 PM
> > > > > To: [email protected]; [email protected]; Andrew Jones <[email protected]>; [email protected];
> > > > > [email protected]; JeeHeng Sia <[email protected]>; Leyfoon Tan <[email protected]>; Mason
> Huo
> > > > > <[email protected]>; Paul Walmsley <[email protected]>; Conor Dooley <[email protected]>;
> > > Guo
> > > > > Ren <[email protected]>
> > > > > Cc: [email protected]; [email protected]
> > > > > Subject: Bug report: kernel paniced when system hibernates
> > > > >
> > > > > Description of problem:
> > > > >
> > > > > The latest hibernation support[1] of RISC-V Linux produced a kernel panic.
> > > > > The entire log has been posted at this link: https://termbin.com/sphl .
> > > > >
> > > > > How reproducible:
> > > > >
> > > > > You can reproduce it with the following step :
> > > > >
> > > > > 1. prepare the environment with
> > > > > - Qemu-virt v8.0.0 (with OpenSbi v1.2)
> > > > > - Linux v6.4-rc1
> > > > >
> > > > > 2. start the Qemu virt
> > > > > ```sh
> > > > > $ cat ~/8_riscv/start_latest.sh
> > > > > #!/bin/bash
> > > > > /home/song/8_riscv/3_acpi/qemu/ooo/usr/local/bin/qemu-system-riscv64 \
> > > > > -smp 2 -m 4G -nographic -machine virt \
> > > > > -kernel /home/song/9_linux/linux/00_rv_test/arch/riscv/boot/Image \
> > > > > -append "root=/dev/vda ro eaylycon=uart8250,mmio,0x10000000
> > > > > early_ioremap_debug console=ttyS0 loglevel=8 memblock=debug
> > > > > no_console_suspend audit=0 3" \
> > > > > -drive file=/home/song/8_riscv/fedora/stage4-disk.img,format=raw,id=hd0 \
> > > > > -device virtio-blk-device,drive=hd0 \
> > > > > -drive file=/home/song/8_riscv/fedora/adisk.qcow2,format=qcow2,id=hd1 \
> > > > > -device virtio-blk-device,drive=hd1 \
> > > > > -gdb tcp::1236 #-S
> > > > > ```
> > > > > 3. execute hibernation
> > > > >
> > > > > ```sh
> > > > > swapon /dev/vdb2 # this is my swap disk
> > > > >
> > > > > echo disk > /sys/power/state
> > > > > ```
> > > > >
> > > > > 4. Then you will encounter the kernel panic logged in the above link
> > > > >
> > > > >
> > > > > Other Information:
> > > > >
> > > > > After my initial and incomplete dig-up, the commit (3335068f8721
> > > > > "riscv: Use PUD/P4D/PGD pages for the linear mapping")[2]
> > > > > is closely related to this panic. This commit uses re-defined
> > > > > `MIN_MEMBLOCK_ADDR` to discover the entire system memory
> > > > > and extends the `va_pa_offset` from `kernel_map.phys_addr` to
> > > > > `phys_ram_base` for linear memory mapping.
> > > > >
> > > > > If the firmware delivered the firmware memory region (like: a PMP
> > > > > protected region in OpenSbi) without "no-map" propriety,
> > > > > this commit will result in firmware memory being directly mapped by
> > > > > `create_linear_mapping_page_table()`.
> > > > >
> > > > > We can see the mapping via ptdump :
> > > > > ```c
> > > > > ---[ Linear mapping ]---
> > > > > 0xff60000000000000-0xff60000000200000 0x0000000080000000 2M PMD D A G
> > > > > . . W R V ------------- the firmware memory
> > > > > 0xff60000000200000-0xff60000000c00000 0x0000000080200000 10M PMD D A G . . . R V
> > > > > 0xff60000000c00000-0xff60000001000000 0x0000000080c00000 4M PMD D A G . . W R V
> > > > > 0xff60000001000000-0xff60000001600000 0x0000000081000000 6M PMD D A G . . . R V
> > > > > 0xff60000001600000-0xff60000040000000 0x0000000081600000 1002M PMD D A
> > > > > G . . W R V
> > > > > 0xff60000040000000-0xff60000100000000 0x00000000c0000000 3G PUD D A G . . W R V
> > > > > ---[ Modules/BPF mapping ]---
> > > > > ---[ Kernel mapping ]---
> > > > > 0xffffffff80000000-0xffffffff80a00000 0x0000000080200000 10M PMD D A G . X . R V
> > > > > 0xffffffff80a00000-0xffffffff80c00000 0x0000000080c00000 2M PMD D A G . . . R V
> > > > > 0xffffffff80c00000-0xffffffff80e00000 0x0000000080e00000 2M PMD D A G . . W R V
> > > > > 0xffffffff80e00000-0xffffffff81400000 0x0000000081000000 6M PMD D A G . . . R V
> > > > > 0xffffffff81400000-0xffffffff81800000 0x0000000081600000 4M PMD
> > > > > ```
> > > > >
> > > > > In the hibernation process, `swsusp_save()` calls
> > > > > `copy_data_pages(&copy_bm, &orig_bm)` to copy these two memory
> > > > > bitmaps,
> > > > > the Oops(load access fault) occurred while copying the page of
> > > > > PAGE_OFFSET (which maps the firmware memory).
> > > > >
> > > > > I also did two other tests:
> > > > > Test1:
> > > > >
> > > > > The hibernation works well in the kernel with the commit 3335068f8721
> > > > > reverted at least in the current environment.
> > > > >
> > > > > Test2:
> > > > >
> > > > > I built a simple kernel module to simulate the access of the value of
> > > > > `PAGE_OFFSET` address, and the same panic occurred with the load
> > > > > access fault.
> > > > > So hibernation seems not the only case to trigger this panic.
> > > > >
> > > > > Finally, should we always leave the firmware memory with
> > > > > `MEMBLOCK_NOMAP` flag by some efforts from Linux or OpenSbi (at least
> > > > > in the current environment) or any other suggestions?
> > > > >
> > > > > Please correct me if I'm wrong.
> > > > >
> > > > > [1]: https://lore.kernel.org/r/[email protected]
> > > > > [2]: https://lore.kernel.org/r/[email protected]
> > > > >
> > > > > --
> > > > > Thanks,
> > > > > Song

2023-05-18 07:16:20

by Anup Patel

[permalink] [raw]
Subject: Re: Bug report: kernel paniced when system hibernates

On Wed, May 17, 2023 at 8:26 PM Alexandre Ghiti <[email protected]> wrote:
>
> On Wed, May 17, 2023 at 1:28 PM Conor Dooley <[email protected]> wrote:
> >
> > Hey Alex,
> >
> > On Wed, May 17, 2023 at 10:58:02AM +0200, Alexandre Ghiti wrote:
> > > On Tue, May 16, 2023 at 1:12 PM Alexandre Ghiti <[email protected]> wrote:
> >
> > > > On Tue, May 16, 2023 at 11:24 AM Song Shuai <[email protected]> wrote:
> > > > I actually removed this flag a few years ago, and I have to admit that
> > > > I need to check if that's necessary: the goal of commit 3335068f8721
> > > > ("riscv: Use PUD/P4D/PGD pages for the linear mapping") is to expose
> > > > the "right" start of DRAM so that we can align virtual and physical
> > > > addresses on a 1GB boundary.
> > > >
> > > > So I have to check if a nomap region is actually added as a
> > > > memblock.memory.regions[] or not: if yes, that's perfect, let's add
> > > > the nomap attributes to the PMP regions, otherwise, I don't think that
> > > > is a good solution.
> > >
> > > So here is the current linear mapping without nomap in openSBI:
> > >
> > > ---[ Linear mapping ]---
> > > 0xff60000000000000-0xff60000000200000 0x0000000080000000 2M
> > > PMD D A G . . W R V
> > > 0xff60000000200000-0xff60000000e00000 0x0000000080200000 12M
> > > PMD D A G . . . R V
> > >
> > > And below the linear mapping with nomap in openSBI:
> > >
> > > ---[ Linear mapping ]---
> > > 0xff60000000080000-0xff60000000200000 0x0000000080080000 1536K
> > > PTE D A G . . W R V
> > > 0xff60000000200000-0xff60000000e00000 0x0000000080200000 12M
> > > PMD D A G . . . R V
> > >
> > > So adding nomap does not misalign virtual and physical addresses, it
> > > prevents the usage of 1GB page for this area though, so that's a
> > > solution, we just lose this 1GB page here.
> > >
> > > But even though that may be the fix, I think we also need to fix that
> > > in the kernel as it would break compatibility with certain versions of
> > > openSBI *if* we fix openSBI...So here are a few solutions:
> > >
> > > 1. we can mark all "mmode_resv" nodes in the device tree as nomap,
> > > before the linear mapping is established (IIUC, those nodes are added
> > > by openSBI to advertise PMP regions)
> > > -> This amounts to the same fix as opensbi and we lose the 1GB hugepage.
> >
> > AFAIU, losing the 1 GB hugepage is a regression, which would make this
> > not an option, right?
>
> Not sure this is a real regression, I'd rather avoid it, but as
> mentioned in my first answer, Mike Rapoport showed that it was making
> no difference performance-wise...
>
> >
> > > 2. we can tweak pfn_is_nosave function to *not* save pfn corresponding
> > > to PMP regions
> > > -> We don't lose the 1GB hugepage \o/
> > > 3. we can use register_nosave_region() to not save the "mmode_resv"
> > > regions (x86 does that
> > > https://elixir.bootlin.com/linux/v6.4-rc1/source/arch/x86/kernel/e820.c#L753)
> > > -> We don't lose the 1GB hugepage \o/
> > > 4. Given JeeHeng pointer to
> > > https://elixir.bootlin.com/linux/v6.4-rc1/source/kernel/power/snapshot.c#L1340,
> > > we can mark those pages as non-readable and make the hibernation
> > > process not save those pages
> > > -> Very late-in-the-day idea, not sure what it's worth, we also
> > > lose the 1GB hugepage...
> >
> > Ditto here re: introducing another regression.
> >
> > > To me, the best solution is 3 as it would prepare for other similar
> > > issues later, it is similar to x86 and it allows us to keep 1GB
> > > hugepages.
> > >
> > > I have been thinking, and to me nomap does not provide anything since
> > > the kernel should not address this memory range, so if it does, we
> > > must fix the kernel.
> > >
> > > Let me know what you all think, I'll be preparing a PoC of 3 in the meantime!
> >
> > #3 would probably get my vote too. It seems like you could use it
> > dynamically if there was to be a future other provider of "mmode_resv"
> > regions, rather than doing something location-specific.
> >
> > We should probably document these opensbi reserved memory nodes though
> > in a dt-binding or w/e if we are going to be relying on them to not
> > crash!

Depending on a particular node name is fragile. If we really need
information from DT then I suggest adding "no-save-restore" DT
property in reserved memory nodes.

>
> Yes, you're right, let's see what Atish and Anup think!

I think we have two possible approaches:

1) Update OpenSBI to set "no-map" DT property for firmware
reserved regions. We were doing this previously but removed
it later for performance reasons mentioned by Alex. It is also
worth mentioning that ARM Trusted Firmware also sets "no-map"
DT property for firmware reserved regions.

2) Add a new "no-save-restore" DT property in the reserved
memory DT bindings. The hibernate support of Linux arch/riscv
will use this DT property to exclude memory regions from
save-restore. The EFI implementation of EDK2 and U-Boot
should do the following:
1) Treat all memory having "no-map" DT property as EFI
reserved memory
2) Treat all memory not having "no-map" DT property and
not having "no-save-restore" DT property as EfiBootServicesData
3) Treat all memory not having "no-map" DT property and
having "no-save-restore" DT property as EfiRuntimeServiceData
(Refer,
https://devicetree-specification.readthedocs.io/en/latest/chapter3-devicenodes.html#reserved-memory-and-uefi)

Personally, I am leaning towards approach#1 since approach#2
will require changing DeviceTree specification as well.

Regards,
Anup

>
> Thanks for your quick answers Conor and Song, really appreciated!
>
> Alex
>
> >
> > Thanks for working on this,
> > Conor.
> >

2023-05-18 08:17:10

by Conor Dooley

[permalink] [raw]
Subject: Re: Bug report: kernel paniced when system hibernates

Hey Alex, Anup,

On Thu, May 18, 2023 at 12:23:59PM +0530, Anup Patel wrote:
> On Wed, May 17, 2023 at 8:26 PM Alexandre Ghiti <[email protected]> wrote:
> > On Wed, May 17, 2023 at 1:28 PM Conor Dooley <[email protected]> wrote:
> > > On Wed, May 17, 2023 at 10:58:02AM +0200, Alexandre Ghiti wrote:
> > > > On Tue, May 16, 2023 at 1:12 PM Alexandre Ghiti <[email protected]> wrote:
> > >
> > > > > On Tue, May 16, 2023 at 11:24 AM Song Shuai <[email protected]> wrote:
> > > > > I actually removed this flag a few years ago, and I have to admit that
> > > > > I need to check if that's necessary: the goal of commit 3335068f8721
> > > > > ("riscv: Use PUD/P4D/PGD pages for the linear mapping") is to expose
> > > > > the "right" start of DRAM so that we can align virtual and physical
> > > > > addresses on a 1GB boundary.
> > > > >
> > > > > So I have to check if a nomap region is actually added as a
> > > > > memblock.memory.regions[] or not: if yes, that's perfect, let's add
> > > > > the nomap attributes to the PMP regions, otherwise, I don't think that
> > > > > is a good solution.
> > > >
> > > > So here is the current linear mapping without nomap in openSBI:
> > > >
> > > > ---[ Linear mapping ]---
> > > > 0xff60000000000000-0xff60000000200000 0x0000000080000000 2M
> > > > PMD D A G . . W R V
> > > > 0xff60000000200000-0xff60000000e00000 0x0000000080200000 12M
> > > > PMD D A G . . . R V
> > > >
> > > > And below the linear mapping with nomap in openSBI:
> > > >
> > > > ---[ Linear mapping ]---
> > > > 0xff60000000080000-0xff60000000200000 0x0000000080080000 1536K
> > > > PTE D A G . . W R V
> > > > 0xff60000000200000-0xff60000000e00000 0x0000000080200000 12M
> > > > PMD D A G . . . R V
> > > >
> > > > So adding nomap does not misalign virtual and physical addresses, it
> > > > prevents the usage of 1GB page for this area though, so that's a
> > > > solution, we just lose this 1GB page here.
> > > >
> > > > But even though that may be the fix, I think we also need to fix that
> > > > in the kernel as it would break compatibility with certain versions of
> > > > openSBI *if* we fix openSBI...So here are a few solutions:
> > > >
> > > > 1. we can mark all "mmode_resv" nodes in the device tree as nomap,
> > > > before the linear mapping is established (IIUC, those nodes are added
> > > > by openSBI to advertise PMP regions)
> > > > -> This amounts to the same fix as opensbi and we lose the 1GB hugepage.
> > >
> > > AFAIU, losing the 1 GB hugepage is a regression, which would make this
> > > not an option, right?
> >
> > Not sure this is a real regression, I'd rather avoid it, but as
> > mentioned in my first answer, Mike Rapoport showed that it was making
> > no difference performance-wise...

My point was that if someone has hugepages enabled & we handle this in a
way that causes the first hugepage to be unusable, is that not a
regression? Whether hugepages provide a performance benefit is not
really related to that question AFAICT.

Were you suggesting reverting hugepage support entirely in your original
message? If we entirely remove hugepage support, then I guess the first
hugepage cannot be lost..

> > > > 2. we can tweak pfn_is_nosave function to *not* save pfn corresponding
> > > > to PMP regions
> > > > -> We don't lose the 1GB hugepage \o/
> > > > 3. we can use register_nosave_region() to not save the "mmode_resv"
> > > > regions (x86 does that
> > > > https://elixir.bootlin.com/linux/v6.4-rc1/source/arch/x86/kernel/e820.c#L753)
> > > > -> We don't lose the 1GB hugepage \o/
> > > > 4. Given JeeHeng pointer to
> > > > https://elixir.bootlin.com/linux/v6.4-rc1/source/kernel/power/snapshot.c#L1340,
> > > > we can mark those pages as non-readable and make the hibernation
> > > > process not save those pages
> > > > -> Very late-in-the-day idea, not sure what it's worth, we also
> > > > lose the 1GB hugepage...
> > >
> > > Ditto here re: introducing another regression.
> > >
> > > > To me, the best solution is 3 as it would prepare for other similar
> > > > issues later, it is similar to x86 and it allows us to keep 1GB
> > > > hugepages.
> > > >
> > > > I have been thinking, and to me nomap does not provide anything since
> > > > the kernel should not address this memory range, so if it does, we
> > > > must fix the kernel.
> > > >
> > > > Let me know what you all think, I'll be preparing a PoC of 3 in the meantime!
> > >
> > > #3 would probably get my vote too. It seems like you could use it
> > > dynamically if there was to be a future other provider of "mmode_resv"
> > > regions, rather than doing something location-specific.
> > >
> > > We should probably document these opensbi reserved memory nodes though
> > > in a dt-binding or w/e if we are going to be relying on them to not
> > > crash!
>
> Depending on a particular node name is fragile. If we really need
> information from DT then I suggest adding "no-save-restore" DT
> property in reserved memory nodes.

We can add whatever properties we like, but where does that leave us for
the systems in the wild where their reserved memory nodes do not contain
a "no-save-restore" property or "no-map"?

Ideally, yes, we do not depend on the node name and instead use explicit
properties - but I think we may be "forced" to fall back to checking the
node-name to cover the opensbi versions that do not contain one.
LMK if I have missed something there!

> > Yes, you're right, let's see what Atish and Anup think!
>
> I think we have two possible approaches:
>
> 1) Update OpenSBI to set "no-map" DT property for firmware
> reserved regions. We were doing this previously but removed
> it later for performance reasons mentioned by Alex. It is also
> worth mentioning that ARM Trusted Firmware also sets "no-map"
> DT property for firmware reserved regions.
>
> 2) Add a new "no-save-restore" DT property in the reserved
> memory DT bindings. The hibernate support of Linux arch/riscv
> will use this DT property to exclude memory regions from
> save-restore. The EFI implementation of EDK2 and U-Boot
> should do the following:
> 1) Treat all memory having "no-map" DT property as EFI
> reserved memory
> 2) Treat all memory not having "no-map" DT property and
> not having "no-save-restore" DT property as EfiBootServicesData
> 3) Treat all memory not having "no-map" DT property and
> having "no-save-restore" DT property as EfiRuntimeServiceData
> (Refer,
> https://devicetree-specification.readthedocs.io/en/latest/chapter3-devicenodes.html#reserved-memory-and-uefi)
>
> Personally, I am leaning towards approach#1 since approach#2
> will require changing DeviceTree specification as well.

#1 is by far the simpler option of the two, if the consensus is that the
loss of the first hugepage is not a problem (or if it is a problem that
realistically is unavoidable).

There's something else I think I might be missing here, given the
scattered nature of the reporting. This is not a problem for a system
that does not implement hibernation, which was only added in v6.4-rc1?

That would make it not a regression after all. I think I misunderstood
the report on sw-dev to mean that this was a problem generally after
v6.4-rc1, which would have been one. Could someone please confirm that?

If it only affects hibernation, and is not a regression, should we make
ARCH_HIBERNATION_POSSIBLE def_bool n in Kconfig until we have agreed on
a solution for the problem?

Thanks,
Conor.


Attachments:
(No filename) (7.59 kB)
signature.asc (235.00 B)
Download all attachments

2023-05-18 09:06:15

by Alexandre Ghiti

[permalink] [raw]
Subject: Re: Bug report: kernel paniced when system hibernates

Hi Conor, Anup!

On Thu, May 18, 2023 at 10:00 AM Conor Dooley
<[email protected]> wrote:
>
> Hey Alex, Anup,
>
> On Thu, May 18, 2023 at 12:23:59PM +0530, Anup Patel wrote:
> > On Wed, May 17, 2023 at 8:26 PM Alexandre Ghiti <[email protected]> wrote:
> > > On Wed, May 17, 2023 at 1:28 PM Conor Dooley <[email protected]> wrote:
> > > > On Wed, May 17, 2023 at 10:58:02AM +0200, Alexandre Ghiti wrote:
> > > > > On Tue, May 16, 2023 at 1:12 PM Alexandre Ghiti <[email protected]> wrote:
> > > >
> > > > > > On Tue, May 16, 2023 at 11:24 AM Song Shuai <[email protected]> wrote:
> > > > > > I actually removed this flag a few years ago, and I have to admit that
> > > > > > I need to check if that's necessary: the goal of commit 3335068f8721
> > > > > > ("riscv: Use PUD/P4D/PGD pages for the linear mapping") is to expose
> > > > > > the "right" start of DRAM so that we can align virtual and physical
> > > > > > addresses on a 1GB boundary.
> > > > > >
> > > > > > So I have to check if a nomap region is actually added as a
> > > > > > memblock.memory.regions[] or not: if yes, that's perfect, let's add
> > > > > > the nomap attributes to the PMP regions, otherwise, I don't think that
> > > > > > is a good solution.
> > > > >
> > > > > So here is the current linear mapping without nomap in openSBI:
> > > > >
> > > > > ---[ Linear mapping ]---
> > > > > 0xff60000000000000-0xff60000000200000 0x0000000080000000 2M
> > > > > PMD D A G . . W R V
> > > > > 0xff60000000200000-0xff60000000e00000 0x0000000080200000 12M
> > > > > PMD D A G . . . R V
> > > > >
> > > > > And below the linear mapping with nomap in openSBI:
> > > > >
> > > > > ---[ Linear mapping ]---
> > > > > 0xff60000000080000-0xff60000000200000 0x0000000080080000 1536K
> > > > > PTE D A G . . W R V
> > > > > 0xff60000000200000-0xff60000000e00000 0x0000000080200000 12M
> > > > > PMD D A G . . . R V
> > > > >
> > > > > So adding nomap does not misalign virtual and physical addresses, it
> > > > > prevents the usage of 1GB page for this area though, so that's a
> > > > > solution, we just lose this 1GB page here.
> > > > >
> > > > > But even though that may be the fix, I think we also need to fix that
> > > > > in the kernel as it would break compatibility with certain versions of
> > > > > openSBI *if* we fix openSBI...So here are a few solutions:
> > > > >
> > > > > 1. we can mark all "mmode_resv" nodes in the device tree as nomap,
> > > > > before the linear mapping is established (IIUC, those nodes are added
> > > > > by openSBI to advertise PMP regions)
> > > > > -> This amounts to the same fix as opensbi and we lose the 1GB hugepage.
> > > >
> > > > AFAIU, losing the 1 GB hugepage is a regression, which would make this
> > > > not an option, right?
> > >
> > > Not sure this is a real regression, I'd rather avoid it, but as
> > > mentioned in my first answer, Mike Rapoport showed that it was making
> > > no difference performance-wise...
>
> My point was that if someone has hugepages enabled & we handle this in a
> way that causes the first hugepage to be unusable, is that not a
> regression? Whether hugepages provide a performance benefit is not
> really related to that question AFAICT.

Not being able to map certain regions of the linear mapping with a 1GB
hugepage will happen, for example the kernel mapping is protected in
the linear mapping so that it can't be written: so we can only map
this region with 2MB hugepages. A firmware could mark a region as
"no-map" and there again we would not be able to use a 1GB hugepage. I
don't see that as a regression as the intention is not to *always* use
1GB hugepages, but rather to use them when possible. Does that make
sense?

>
> Were you suggesting reverting hugepage support entirely in your original
> message? If we entirely remove hugepage support, then I guess the first
> hugepage cannot be lost..

Given Mike Rapoport's recent talk, I think that's an option, yes.

>
> > > > > 2. we can tweak pfn_is_nosave function to *not* save pfn corresponding
> > > > > to PMP regions
> > > > > -> We don't lose the 1GB hugepage \o/
> > > > > 3. we can use register_nosave_region() to not save the "mmode_resv"
> > > > > regions (x86 does that
> > > > > https://elixir.bootlin.com/linux/v6.4-rc1/source/arch/x86/kernel/e820.c#L753)
> > > > > -> We don't lose the 1GB hugepage \o/
> > > > > 4. Given JeeHeng pointer to
> > > > > https://elixir.bootlin.com/linux/v6.4-rc1/source/kernel/power/snapshot.c#L1340,
> > > > > we can mark those pages as non-readable and make the hibernation
> > > > > process not save those pages
> > > > > -> Very late-in-the-day idea, not sure what it's worth, we also
> > > > > lose the 1GB hugepage...
> > > >
> > > > Ditto here re: introducing another regression.
> > > >
> > > > > To me, the best solution is 3 as it would prepare for other similar
> > > > > issues later, it is similar to x86 and it allows us to keep 1GB
> > > > > hugepages.
> > > > >
> > > > > I have been thinking, and to me nomap does not provide anything since
> > > > > the kernel should not address this memory range, so if it does, we
> > > > > must fix the kernel.
> > > > >
> > > > > Let me know what you all think, I'll be preparing a PoC of 3 in the meantime!
> > > >
> > > > #3 would probably get my vote too. It seems like you could use it
> > > > dynamically if there was to be a future other provider of "mmode_resv"
> > > > regions, rather than doing something location-specific.
> > > >
> > > > We should probably document these opensbi reserved memory nodes though
> > > > in a dt-binding or w/e if we are going to be relying on them to not
> > > > crash!
> >
> > Depending on a particular node name is fragile. If we really need
> > information from DT then I suggest adding "no-save-restore" DT
> > property in reserved memory nodes.
>
> We can add whatever properties we like, but where does that leave us for
> the systems in the wild where their reserved memory nodes do not contain
> a "no-save-restore" property or "no-map"?
>
> Ideally, yes, we do not depend on the node name and instead use explicit
> properties - but I think we may be "forced" to fall back to checking the
> node-name to cover the opensbi versions that do not contain one.
> LMK if I have missed something there!

Yes I agree with you, we can implement Anup's solution #1, but we need
to fix the kernel anyway since if we don't, that would make the kernel
hibernation support depend on a certain version of openSBI.

>
> > > Yes, you're right, let's see what Atish and Anup think!
> >
> > I think we have two possible approaches:
> >
> > 1) Update OpenSBI to set "no-map" DT property for firmware
> > reserved regions. We were doing this previously but removed
> > it later for performance reasons mentioned by Alex. It is also
> > worth mentioning that ARM Trusted Firmware also sets "no-map"
> > DT property for firmware reserved regions.
> >
> > 2) Add a new "no-save-restore" DT property in the reserved
> > memory DT bindings. The hibernate support of Linux arch/riscv
> > will use this DT property to exclude memory regions from
> > save-restore. The EFI implementation of EDK2 and U-Boot
> > should do the following:
> > 1) Treat all memory having "no-map" DT property as EFI
> > reserved memory
> > 2) Treat all memory not having "no-map" DT property and
> > not having "no-save-restore" DT property as EfiBootServicesData
> > 3) Treat all memory not having "no-map" DT property and
> > having "no-save-restore" DT property as EfiRuntimeServiceData
> > (Refer,
> > https://devicetree-specification.readthedocs.io/en/latest/chapter3-devicenodes.html#reserved-memory-and-uefi)
> >
> > Personally, I am leaning towards approach#1 since approach#2
> > will require changing DeviceTree specification as well.
>
> #1 is by far the simpler option of the two, if the consensus is that the
> loss of the first hugepage is not a problem (or if it is a problem that
> realistically is unavoidable).

The "no-map" property does not provide much security anyway: the
kernel should not touch a page that is reserved (this is where I may
be wrong), so the real fix to this issue is to make the hibernation
process not save those pages.

>
> There's something else I think I might be missing here, given the
> scattered nature of the reporting. This is not a problem for a system
> that does not implement hibernation, which was only added in v6.4-rc1?
>
> That would make it not a regression after all. I think I misunderstood
> the report on sw-dev to mean that this was a problem generally after
> v6.4-rc1, which would have been one. Could someone please confirm that?

The problem is only present since v6.4-rc1, that's not a regression,
it's just that both patches landed at the same time and gave rise to
this issue.

>
> If it only affects hibernation, and is not a regression, should we make
> ARCH_HIBERNATION_POSSIBLE def_bool n in Kconfig until we have agreed on
> a solution for the problem?
>
> Thanks,
> Conor.

2023-05-18 10:45:47

by Conor Dooley

[permalink] [raw]
Subject: Re: Bug report: kernel paniced when system hibernates

On Thu, May 18, 2023 at 10:41:19AM +0200, Alexandre Ghiti wrote:
> On Thu, May 18, 2023 at 10:00 AM Conor Dooley
> <[email protected]> wrote:
> > On Thu, May 18, 2023 at 12:23:59PM +0530, Anup Patel wrote:
> > > On Wed, May 17, 2023 at 8:26 PM Alexandre Ghiti <[email protected]> wrote:
> > > > On Wed, May 17, 2023 at 1:28 PM Conor Dooley <[email protected]> wrote:
> > > > > On Wed, May 17, 2023 at 10:58:02AM +0200, Alexandre Ghiti wrote:
> > > > > > On Tue, May 16, 2023 at 1:12 PM Alexandre Ghiti <[email protected]> wrote:
> > > > >
> > > > > > > On Tue, May 16, 2023 at 11:24 AM Song Shuai <[email protected]> wrote:
> > > > > > > I actually removed this flag a few years ago, and I have to admit that
> > > > > > > I need to check if that's necessary: the goal of commit 3335068f8721
> > > > > > > ("riscv: Use PUD/P4D/PGD pages for the linear mapping") is to expose
> > > > > > > the "right" start of DRAM so that we can align virtual and physical
> > > > > > > addresses on a 1GB boundary.
> > > > > > >
> > > > > > > So I have to check if a nomap region is actually added as a
> > > > > > > memblock.memory.regions[] or not: if yes, that's perfect, let's add
> > > > > > > the nomap attributes to the PMP regions, otherwise, I don't think that
> > > > > > > is a good solution.
> > > > > >
> > > > > > So here is the current linear mapping without nomap in openSBI:
> > > > > >
> > > > > > ---[ Linear mapping ]---
> > > > > > 0xff60000000000000-0xff60000000200000 0x0000000080000000 2M
> > > > > > PMD D A G . . W R V
> > > > > > 0xff60000000200000-0xff60000000e00000 0x0000000080200000 12M
> > > > > > PMD D A G . . . R V
> > > > > >
> > > > > > And below the linear mapping with nomap in openSBI:
> > > > > >
> > > > > > ---[ Linear mapping ]---
> > > > > > 0xff60000000080000-0xff60000000200000 0x0000000080080000 1536K
> > > > > > PTE D A G . . W R V
> > > > > > 0xff60000000200000-0xff60000000e00000 0x0000000080200000 12M
> > > > > > PMD D A G . . . R V
> > > > > >
> > > > > > So adding nomap does not misalign virtual and physical addresses, it
> > > > > > prevents the usage of 1GB page for this area though, so that's a
> > > > > > solution, we just lose this 1GB page here.
> > > > > >
> > > > > > But even though that may be the fix, I think we also need to fix that
> > > > > > in the kernel as it would break compatibility with certain versions of
> > > > > > openSBI *if* we fix openSBI...So here are a few solutions:
> > > > > >
> > > > > > 1. we can mark all "mmode_resv" nodes in the device tree as nomap,
> > > > > > before the linear mapping is established (IIUC, those nodes are added
> > > > > > by openSBI to advertise PMP regions)
> > > > > > -> This amounts to the same fix as opensbi and we lose the 1GB hugepage.
> > > > >
> > > > > AFAIU, losing the 1 GB hugepage is a regression, which would make this
> > > > > not an option, right?
> > > >
> > > > Not sure this is a real regression, I'd rather avoid it, but as
> > > > mentioned in my first answer, Mike Rapoport showed that it was making
> > > > no difference performance-wise...
> >
> > My point was that if someone has hugepages enabled & we handle this in a
> > way that causes the first hugepage to be unusable, is that not a
> > regression? Whether hugepages provide a performance benefit is not
> > really related to that question AFAICT.
>
> Not being able to map certain regions of the linear mapping with a 1GB
> hugepage will happen, for example the kernel mapping is protected in
> the linear mapping so that it can't be written: so we can only map
> this region with 2MB hugepages. A firmware could mark a region as
> "no-map" and there again we would not be able to use a 1GB hugepage. I
> don't see that as a regression as the intention is not to *always* use
> 1GB hugepages, but rather to use them when possible. Does that make
> sense?

Ah, so it was as I expected - I don't/didn't properly understand
hugepages. Thanks.

> > Were you suggesting reverting hugepage support entirely in your original
> > message? If we entirely remove hugepage support, then I guess the first
> > hugepage cannot be lost..
>
> Given Mike Rapoport's recent talk, I think that's an option, yes.
>
> >
> > > > > > 2. we can tweak pfn_is_nosave function to *not* save pfn corresponding
> > > > > > to PMP regions
> > > > > > -> We don't lose the 1GB hugepage \o/
> > > > > > 3. we can use register_nosave_region() to not save the "mmode_resv"
> > > > > > regions (x86 does that
> > > > > > https://elixir.bootlin.com/linux/v6.4-rc1/source/arch/x86/kernel/e820.c#L753)
> > > > > > -> We don't lose the 1GB hugepage \o/
> > > > > > 4. Given JeeHeng pointer to
> > > > > > https://elixir.bootlin.com/linux/v6.4-rc1/source/kernel/power/snapshot.c#L1340,
> > > > > > we can mark those pages as non-readable and make the hibernation
> > > > > > process not save those pages
> > > > > > -> Very late-in-the-day idea, not sure what it's worth, we also
> > > > > > lose the 1GB hugepage...
> > > > >
> > > > > Ditto here re: introducing another regression.
> > > > >
> > > > > > To me, the best solution is 3 as it would prepare for other similar
> > > > > > issues later, it is similar to x86 and it allows us to keep 1GB
> > > > > > hugepages.
> > > > > >
> > > > > > I have been thinking, and to me nomap does not provide anything since
> > > > > > the kernel should not address this memory range, so if it does, we
> > > > > > must fix the kernel.
> > > > > >
> > > > > > Let me know what you all think, I'll be preparing a PoC of 3 in the meantime!
> > > > >
> > > > > #3 would probably get my vote too. It seems like you could use it
> > > > > dynamically if there was to be a future other provider of "mmode_resv"
> > > > > regions, rather than doing something location-specific.
> > > > >
> > > > > We should probably document these opensbi reserved memory nodes though
> > > > > in a dt-binding or w/e if we are going to be relying on them to not
> > > > > crash!
> > >
> > > Depending on a particular node name is fragile. If we really need
> > > information from DT then I suggest adding "no-save-restore" DT
> > > property in reserved memory nodes.
> >
> > We can add whatever properties we like, but where does that leave us for
> > the systems in the wild where their reserved memory nodes do not contain
> > a "no-save-restore" property or "no-map"?
> >
> > Ideally, yes, we do not depend on the node name and instead use explicit
> > properties - but I think we may be "forced" to fall back to checking the
> > node-name to cover the opensbi versions that do not contain one.
> > LMK if I have missed something there!
>
> Yes I agree with you, we can implement Anup's solution #1, but we need
> to fix the kernel anyway since if we don't, that would make the kernel
> hibernation support depend on a certain version of openSBI.

It's not unreasonable to have things depend on versions of the SBI
implementation, but it is if they're not things that can be probed using
the standard interfaces!

> > > > Yes, you're right, let's see what Atish and Anup think!
> > >
> > > I think we have two possible approaches:
> > >
> > > 1) Update OpenSBI to set "no-map" DT property for firmware
> > > reserved regions. We were doing this previously but removed
> > > it later for performance reasons mentioned by Alex. It is also
> > > worth mentioning that ARM Trusted Firmware also sets "no-map"
> > > DT property for firmware reserved regions.
> > >
> > > 2) Add a new "no-save-restore" DT property in the reserved
> > > memory DT bindings. The hibernate support of Linux arch/riscv
> > > will use this DT property to exclude memory regions from
> > > save-restore. The EFI implementation of EDK2 and U-Boot
> > > should do the following:
> > > 1) Treat all memory having "no-map" DT property as EFI
> > > reserved memory
> > > 2) Treat all memory not having "no-map" DT property and
> > > not having "no-save-restore" DT property as EfiBootServicesData
> > > 3) Treat all memory not having "no-map" DT property and
> > > having "no-save-restore" DT property as EfiRuntimeServiceData
> > > (Refer,
> > > https://devicetree-specification.readthedocs.io/en/latest/chapter3-devicenodes.html#reserved-memory-and-uefi)
> > >
> > > Personally, I am leaning towards approach#1 since approach#2
> > > will require changing DeviceTree specification as well.
> >
> > #1 is by far the simpler option of the two, if the consensus is that the
> > loss of the first hugepage is not a problem (or if it is a problem that
> > realistically is unavoidable).
>
> The "no-map" property does not provide much security anyway: the
> kernel should not touch a page that is reserved (this is where I may
> be wrong), so the real fix to this issue is to make the hibernation
> process not save those pages.

Right, the kernel clearly needs to be able to handle the regions. I, at
least, was commenting on re-using no-map versus creating new properties
for this situation.
I was advocating for re-using the property & changing the kernel so as
not to touch the regions during hibernation.

> > There's something else I think I might be missing here, given the
> > scattered nature of the reporting. This is not a problem for a system
> > that does not implement hibernation, which was only added in v6.4-rc1?
> >
> > That would make it not a regression after all. I think I misunderstood
> > the report on sw-dev to mean that this was a problem generally after
> > v6.4-rc1, which would have been one. Could someone please confirm that?
>
> The problem is only present since v6.4-rc1, that's not a regression,
> it's just that both patches landed at the same time and gave rise to
> this issue.

Sick. Glad to be wrong here!

#regzbot resolve: not a regression, feature introduced this cycle

> > If it only affects hibernation, and is not a regression, should we make
> > ARCH_HIBERNATION_POSSIBLE def_bool n in Kconfig until we have agreed on
> > a solution for the problem?

Any thoughts on this one?


Attachments:
(No filename) (10.12 kB)
signature.asc (235.00 B)
Download all attachments

2023-05-18 12:36:04

by Alexandre Ghiti

[permalink] [raw]
Subject: Re: Bug report: kernel paniced when system hibernates

On Thu, May 18, 2023 at 12:35 PM Conor Dooley
<[email protected]> wrote:
>
> On Thu, May 18, 2023 at 10:41:19AM +0200, Alexandre Ghiti wrote:
> > On Thu, May 18, 2023 at 10:00 AM Conor Dooley
> > <[email protected]> wrote:
> > > On Thu, May 18, 2023 at 12:23:59PM +0530, Anup Patel wrote:
> > > > On Wed, May 17, 2023 at 8:26 PM Alexandre Ghiti <[email protected]> wrote:
> > > > > On Wed, May 17, 2023 at 1:28 PM Conor Dooley <[email protected]> wrote:
> > > > > > On Wed, May 17, 2023 at 10:58:02AM +0200, Alexandre Ghiti wrote:
> > > > > > > On Tue, May 16, 2023 at 1:12 PM Alexandre Ghiti <[email protected]> wrote:
> > > > > >
> > > > > > > > On Tue, May 16, 2023 at 11:24 AM Song Shuai <[email protected]> wrote:
> > > > > > > > I actually removed this flag a few years ago, and I have to admit that
> > > > > > > > I need to check if that's necessary: the goal of commit 3335068f8721
> > > > > > > > ("riscv: Use PUD/P4D/PGD pages for the linear mapping") is to expose
> > > > > > > > the "right" start of DRAM so that we can align virtual and physical
> > > > > > > > addresses on a 1GB boundary.
> > > > > > > >
> > > > > > > > So I have to check if a nomap region is actually added as a
> > > > > > > > memblock.memory.regions[] or not: if yes, that's perfect, let's add
> > > > > > > > the nomap attributes to the PMP regions, otherwise, I don't think that
> > > > > > > > is a good solution.
> > > > > > >
> > > > > > > So here is the current linear mapping without nomap in openSBI:
> > > > > > >
> > > > > > > ---[ Linear mapping ]---
> > > > > > > 0xff60000000000000-0xff60000000200000 0x0000000080000000 2M
> > > > > > > PMD D A G . . W R V
> > > > > > > 0xff60000000200000-0xff60000000e00000 0x0000000080200000 12M
> > > > > > > PMD D A G . . . R V
> > > > > > >
> > > > > > > And below the linear mapping with nomap in openSBI:
> > > > > > >
> > > > > > > ---[ Linear mapping ]---
> > > > > > > 0xff60000000080000-0xff60000000200000 0x0000000080080000 1536K
> > > > > > > PTE D A G . . W R V
> > > > > > > 0xff60000000200000-0xff60000000e00000 0x0000000080200000 12M
> > > > > > > PMD D A G . . . R V
> > > > > > >
> > > > > > > So adding nomap does not misalign virtual and physical addresses, it
> > > > > > > prevents the usage of 1GB page for this area though, so that's a
> > > > > > > solution, we just lose this 1GB page here.
> > > > > > >
> > > > > > > But even though that may be the fix, I think we also need to fix that
> > > > > > > in the kernel as it would break compatibility with certain versions of
> > > > > > > openSBI *if* we fix openSBI...So here are a few solutions:
> > > > > > >
> > > > > > > 1. we can mark all "mmode_resv" nodes in the device tree as nomap,
> > > > > > > before the linear mapping is established (IIUC, those nodes are added
> > > > > > > by openSBI to advertise PMP regions)
> > > > > > > -> This amounts to the same fix as opensbi and we lose the 1GB hugepage.
> > > > > >
> > > > > > AFAIU, losing the 1 GB hugepage is a regression, which would make this
> > > > > > not an option, right?
> > > > >
> > > > > Not sure this is a real regression, I'd rather avoid it, but as
> > > > > mentioned in my first answer, Mike Rapoport showed that it was making
> > > > > no difference performance-wise...
> > >
> > > My point was that if someone has hugepages enabled & we handle this in a
> > > way that causes the first hugepage to be unusable, is that not a
> > > regression? Whether hugepages provide a performance benefit is not
> > > really related to that question AFAICT.
> >
> > Not being able to map certain regions of the linear mapping with a 1GB
> > hugepage will happen, for example the kernel mapping is protected in
> > the linear mapping so that it can't be written: so we can only map
> > this region with 2MB hugepages. A firmware could mark a region as
> > "no-map" and there again we would not be able to use a 1GB hugepage. I
> > don't see that as a regression as the intention is not to *always* use
> > 1GB hugepages, but rather to use them when possible. Does that make
> > sense?
>
> Ah, so it was as I expected - I don't/didn't properly understand
> hugepages. Thanks.
>
> > > Were you suggesting reverting hugepage support entirely in your original
> > > message? If we entirely remove hugepage support, then I guess the first
> > > hugepage cannot be lost..
> >
> > Given Mike Rapoport's recent talk, I think that's an option, yes.
> >
> > >
> > > > > > > 2. we can tweak pfn_is_nosave function to *not* save pfn corresponding
> > > > > > > to PMP regions
> > > > > > > -> We don't lose the 1GB hugepage \o/
> > > > > > > 3. we can use register_nosave_region() to not save the "mmode_resv"
> > > > > > > regions (x86 does that
> > > > > > > https://elixir.bootlin.com/linux/v6.4-rc1/source/arch/x86/kernel/e820.c#L753)
> > > > > > > -> We don't lose the 1GB hugepage \o/
> > > > > > > 4. Given JeeHeng pointer to
> > > > > > > https://elixir.bootlin.com/linux/v6.4-rc1/source/kernel/power/snapshot.c#L1340,
> > > > > > > we can mark those pages as non-readable and make the hibernation
> > > > > > > process not save those pages
> > > > > > > -> Very late-in-the-day idea, not sure what it's worth, we also
> > > > > > > lose the 1GB hugepage...
> > > > > >
> > > > > > Ditto here re: introducing another regression.
> > > > > >
> > > > > > > To me, the best solution is 3 as it would prepare for other similar
> > > > > > > issues later, it is similar to x86 and it allows us to keep 1GB
> > > > > > > hugepages.
> > > > > > >
> > > > > > > I have been thinking, and to me nomap does not provide anything since
> > > > > > > the kernel should not address this memory range, so if it does, we
> > > > > > > must fix the kernel.
> > > > > > >
> > > > > > > Let me know what you all think, I'll be preparing a PoC of 3 in the meantime!
> > > > > >
> > > > > > #3 would probably get my vote too. It seems like you could use it
> > > > > > dynamically if there was to be a future other provider of "mmode_resv"
> > > > > > regions, rather than doing something location-specific.
> > > > > >
> > > > > > We should probably document these opensbi reserved memory nodes though
> > > > > > in a dt-binding or w/e if we are going to be relying on them to not
> > > > > > crash!
> > > >
> > > > Depending on a particular node name is fragile. If we really need
> > > > information from DT then I suggest adding "no-save-restore" DT
> > > > property in reserved memory nodes.
> > >
> > > We can add whatever properties we like, but where does that leave us for
> > > the systems in the wild where their reserved memory nodes do not contain
> > > a "no-save-restore" property or "no-map"?
> > >
> > > Ideally, yes, we do not depend on the node name and instead use explicit
> > > properties - but I think we may be "forced" to fall back to checking the
> > > node-name to cover the opensbi versions that do not contain one.
> > > LMK if I have missed something there!
> >
> > Yes I agree with you, we can implement Anup's solution #1, but we need
> > to fix the kernel anyway since if we don't, that would make the kernel
> > hibernation support depend on a certain version of openSBI.
>
> It's not unreasonable to have things depend on versions of the SBI
> implementation, but it is if they're not things that can be probed using
> the standard interfaces!
>
> > > > > Yes, you're right, let's see what Atish and Anup think!
> > > >
> > > > I think we have two possible approaches:
> > > >
> > > > 1) Update OpenSBI to set "no-map" DT property for firmware
> > > > reserved regions. We were doing this previously but removed
> > > > it later for performance reasons mentioned by Alex. It is also
> > > > worth mentioning that ARM Trusted Firmware also sets "no-map"
> > > > DT property for firmware reserved regions.
> > > >
> > > > 2) Add a new "no-save-restore" DT property in the reserved
> > > > memory DT bindings. The hibernate support of Linux arch/riscv
> > > > will use this DT property to exclude memory regions from
> > > > save-restore. The EFI implementation of EDK2 and U-Boot
> > > > should do the following:
> > > > 1) Treat all memory having "no-map" DT property as EFI
> > > > reserved memory
> > > > 2) Treat all memory not having "no-map" DT property and
> > > > not having "no-save-restore" DT property as EfiBootServicesData
> > > > 3) Treat all memory not having "no-map" DT property and
> > > > having "no-save-restore" DT property as EfiRuntimeServiceData
> > > > (Refer,
> > > > https://devicetree-specification.readthedocs.io/en/latest/chapter3-devicenodes.html#reserved-memory-and-uefi)
> > > >
> > > > Personally, I am leaning towards approach#1 since approach#2
> > > > will require changing DeviceTree specification as well.
> > >
> > > #1 is by far the simpler option of the two, if the consensus is that the
> > > loss of the first hugepage is not a problem (or if it is a problem that
> > > realistically is unavoidable).
> >
> > The "no-map" property does not provide much security anyway: the
> > kernel should not touch a page that is reserved (this is where I may
> > be wrong), so the real fix to this issue is to make the hibernation
> > process not save those pages.
>
> Right, the kernel clearly needs to be able to handle the regions. I, at
> least, was commenting on re-using no-map versus creating new properties
> for this situation.
> I was advocating for re-using the property & changing the kernel so as
> not to touch the regions during hibernation.
>
> > > There's something else I think I might be missing here, given the
> > > scattered nature of the reporting. This is not a problem for a system
> > > that does not implement hibernation, which was only added in v6.4-rc1?
> > >
> > > That would make it not a regression after all. I think I misunderstood
> > > the report on sw-dev to mean that this was a problem generally after
> > > v6.4-rc1, which would have been one. Could someone please confirm that?
> >
> > The problem is only present since v6.4-rc1, that's not a regression,
> > it's just that both patches landed at the same time and gave rise to
> > this issue.
>
> Sick. Glad to be wrong here!
>
> #regzbot resolve: not a regression, feature introduced this cycle
>
> > > If it only affects hibernation, and is not a regression, should we make
> > > ARCH_HIBERNATION_POSSIBLE def_bool n in Kconfig until we have agreed on
> > > a solution for the problem?
>
> Any thoughts on this one?

Ahah, I don't know, I tried to dodge the question :) But if we don't
fix this issue, hibernation won't work so is it enough?

2023-05-18 12:38:54

by Alexandre Ghiti

[permalink] [raw]
Subject: Re: Bug report: kernel paniced when system hibernates

On 5/18/23 08:53, Anup Patel wrote:
> On Wed, May 17, 2023 at 8:26 PM Alexandre Ghiti <[email protected]> wrote:
>> On Wed, May 17, 2023 at 1:28 PM Conor Dooley <[email protected]> wrote:
>>> Hey Alex,
>>>
>>> On Wed, May 17, 2023 at 10:58:02AM +0200, Alexandre Ghiti wrote:
>>>> On Tue, May 16, 2023 at 1:12 PM Alexandre Ghiti <[email protected]> wrote:
>>>>> On Tue, May 16, 2023 at 11:24 AM Song Shuai <[email protected]> wrote:
>>>>> I actually removed this flag a few years ago, and I have to admit that
>>>>> I need to check if that's necessary: the goal of commit 3335068f8721
>>>>> ("riscv: Use PUD/P4D/PGD pages for the linear mapping") is to expose
>>>>> the "right" start of DRAM so that we can align virtual and physical
>>>>> addresses on a 1GB boundary.
>>>>>
>>>>> So I have to check if a nomap region is actually added as a
>>>>> memblock.memory.regions[] or not: if yes, that's perfect, let's add
>>>>> the nomap attributes to the PMP regions, otherwise, I don't think that
>>>>> is a good solution.
>>>> So here is the current linear mapping without nomap in openSBI:
>>>>
>>>> ---[ Linear mapping ]---
>>>> 0xff60000000000000-0xff60000000200000 0x0000000080000000 2M
>>>> PMD D A G . . W R V
>>>> 0xff60000000200000-0xff60000000e00000 0x0000000080200000 12M
>>>> PMD D A G . . . R V
>>>>
>>>> And below the linear mapping with nomap in openSBI:
>>>>
>>>> ---[ Linear mapping ]---
>>>> 0xff60000000080000-0xff60000000200000 0x0000000080080000 1536K
>>>> PTE D A G . . W R V
>>>> 0xff60000000200000-0xff60000000e00000 0x0000000080200000 12M
>>>> PMD D A G . . . R V
>>>>
>>>> So adding nomap does not misalign virtual and physical addresses, it
>>>> prevents the usage of 1GB page for this area though, so that's a
>>>> solution, we just lose this 1GB page here.
>>>>
>>>> But even though that may be the fix, I think we also need to fix that
>>>> in the kernel as it would break compatibility with certain versions of
>>>> openSBI *if* we fix openSBI...So here are a few solutions:
>>>>
>>>> 1. we can mark all "mmode_resv" nodes in the device tree as nomap,
>>>> before the linear mapping is established (IIUC, those nodes are added
>>>> by openSBI to advertise PMP regions)
>>>> -> This amounts to the same fix as opensbi and we lose the 1GB hugepage.
>>> AFAIU, losing the 1 GB hugepage is a regression, which would make this
>>> not an option, right?
>> Not sure this is a real regression, I'd rather avoid it, but as
>> mentioned in my first answer, Mike Rapoport showed that it was making
>> no difference performance-wise...
>>
>>>> 2. we can tweak pfn_is_nosave function to *not* save pfn corresponding
>>>> to PMP regions
>>>> -> We don't lose the 1GB hugepage \o/
>>>> 3. we can use register_nosave_region() to not save the "mmode_resv"
>>>> regions (x86 does that
>>>> https://elixir.bootlin.com/linux/v6.4-rc1/source/arch/x86/kernel/e820.c#L753)
>>>> -> We don't lose the 1GB hugepage \o/
>>>> 4. Given JeeHeng pointer to
>>>> https://elixir.bootlin.com/linux/v6.4-rc1/source/kernel/power/snapshot.c#L1340,
>>>> we can mark those pages as non-readable and make the hibernation
>>>> process not save those pages
>>>> -> Very late-in-the-day idea, not sure what it's worth, we also
>>>> lose the 1GB hugepage...
>>> Ditto here re: introducing another regression.
>>>
>>>> To me, the best solution is 3 as it would prepare for other similar
>>>> issues later, it is similar to x86 and it allows us to keep 1GB
>>>> hugepages.
>>>>
>>>> I have been thinking, and to me nomap does not provide anything since
>>>> the kernel should not address this memory range, so if it does, we
>>>> must fix the kernel.
>>>>
>>>> Let me know what you all think, I'll be preparing a PoC of 3 in the meantime!
>>> #3 would probably get my vote too. It seems like you could use it
>>> dynamically if there was to be a future other provider of "mmode_resv"
>>> regions, rather than doing something location-specific.
>>>
>>> We should probably document these opensbi reserved memory nodes though
>>> in a dt-binding or w/e if we are going to be relying on them to not
>>> crash!
> Depending on a particular node name is fragile. If we really need
> information from DT then I suggest adding "no-save-restore" DT
> property in reserved memory nodes.


I understand your point, the node name is the only thing I found that
would work with current opensbi: any other idea what we could use instead?


>> Yes, you're right, let's see what Atish and Anup think!
> I think we have two possible approaches:
>
> 1) Update OpenSBI to set "no-map" DT property for firmware
> reserved regions. We were doing this previously but removed
> it later for performance reasons mentioned by Alex. It is also
> worth mentioning that ARM Trusted Firmware also sets "no-map"
> DT property for firmware reserved regions.
>
> 2) Add a new "no-save-restore" DT property in the reserved
> memory DT bindings. The hibernate support of Linux arch/riscv
> will use this DT property to exclude memory regions from
> save-restore. The EFI implementation of EDK2 and U-Boot
> should do the following:
> 1) Treat all memory having "no-map" DT property as EFI
> reserved memory
> 2) Treat all memory not having "no-map" DT property and
> not having "no-save-restore" DT property as EfiBootServicesData
> 3) Treat all memory not having "no-map" DT property and
> having "no-save-restore" DT property as EfiRuntimeServiceData
> (Refer,
> https://devicetree-specification.readthedocs.io/en/latest/chapter3-devicenodes.html#reserved-memory-and-uefi)
>
> Personally, I am leaning towards approach#1 since approach#2
> will require changing DeviceTree specification as well.


If needed, indeed #1 is the simplest, but I insist, to me it is not
needed (and we don't have it in the current opensbi), if you have
another opinion, I'm open to discuss it!

Thanks for your quick answer Anup,

Alex


>
> Regards,
> Anup
>
>> Thanks for your quick answers Conor and Song, really appreciated!
>>
>> Alex
>>
>>> Thanks for working on this,
>>> Conor.
>>>
> _______________________________________________
> linux-riscv mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-riscv

2023-05-18 12:42:02

by Sia Jee Heng

[permalink] [raw]
Subject: RE: Bug report: kernel paniced when system hibernates



> -----Original Message-----
> From: Conor Dooley <[email protected]>
> Sent: Thursday, May 18, 2023 6:35 PM
> To: Alexandre Ghiti <[email protected]>
> Cc: Anup Patel <[email protected]>; Song Shuai <[email protected]>; [email protected]; Andrew Jones
> <[email protected]>; [email protected]; JeeHeng Sia <[email protected]>; Leyfoon Tan
> <[email protected]>; Mason Huo <[email protected]>; Paul Walmsley <[email protected]>; Guo Ren
> <[email protected]>; [email protected]; [email protected]; [email protected]
> Subject: Re: Bug report: kernel paniced when system hibernates
>
> On Thu, May 18, 2023 at 10:41:19AM +0200, Alexandre Ghiti wrote:
> > On Thu, May 18, 2023 at 10:00 AM Conor Dooley
> > <[email protected]> wrote:
> > > On Thu, May 18, 2023 at 12:23:59PM +0530, Anup Patel wrote:
> > > > On Wed, May 17, 2023 at 8:26 PM Alexandre Ghiti <[email protected]> wrote:
> > > > > On Wed, May 17, 2023 at 1:28 PM Conor Dooley <[email protected]> wrote:
> > > > > > On Wed, May 17, 2023 at 10:58:02AM +0200, Alexandre Ghiti wrote:
> > > > > > > On Tue, May 16, 2023 at 1:12 PM Alexandre Ghiti <[email protected]> wrote:
> > > > > >
> > > > > > > > On Tue, May 16, 2023 at 11:24 AM Song Shuai <[email protected]> wrote:
> > > > > > > > I actually removed this flag a few years ago, and I have to admit that
> > > > > > > > I need to check if that's necessary: the goal of commit 3335068f8721
> > > > > > > > ("riscv: Use PUD/P4D/PGD pages for the linear mapping") is to expose
> > > > > > > > the "right" start of DRAM so that we can align virtual and physical
> > > > > > > > addresses on a 1GB boundary.
> > > > > > > >
> > > > > > > > So I have to check if a nomap region is actually added as a
> > > > > > > > memblock.memory.regions[] or not: if yes, that's perfect, let's add
> > > > > > > > the nomap attributes to the PMP regions, otherwise, I don't think that
> > > > > > > > is a good solution.
> > > > > > >
> > > > > > > So here is the current linear mapping without nomap in openSBI:
> > > > > > >
> > > > > > > ---[ Linear mapping ]---
> > > > > > > 0xff60000000000000-0xff60000000200000 0x0000000080000000 2M
> > > > > > > PMD D A G . . W R V
> > > > > > > 0xff60000000200000-0xff60000000e00000 0x0000000080200000 12M
> > > > > > > PMD D A G . . . R V
> > > > > > >
> > > > > > > And below the linear mapping with nomap in openSBI:
> > > > > > >
> > > > > > > ---[ Linear mapping ]---
> > > > > > > 0xff60000000080000-0xff60000000200000 0x0000000080080000 1536K
> > > > > > > PTE D A G . . W R V
> > > > > > > 0xff60000000200000-0xff60000000e00000 0x0000000080200000 12M
> > > > > > > PMD D A G . . . R V
> > > > > > >
> > > > > > > So adding nomap does not misalign virtual and physical addresses, it
> > > > > > > prevents the usage of 1GB page for this area though, so that's a
> > > > > > > solution, we just lose this 1GB page here.
> > > > > > >
> > > > > > > But even though that may be the fix, I think we also need to fix that
> > > > > > > in the kernel as it would break compatibility with certain versions of
> > > > > > > openSBI *if* we fix openSBI...So here are a few solutions:
> > > > > > >
> > > > > > > 1. we can mark all "mmode_resv" nodes in the device tree as nomap,
> > > > > > > before the linear mapping is established (IIUC, those nodes are added
> > > > > > > by openSBI to advertise PMP regions)
> > > > > > > -> This amounts to the same fix as opensbi and we lose the 1GB hugepage.
> > > > > >
> > > > > > AFAIU, losing the 1 GB hugepage is a regression, which would make this
> > > > > > not an option, right?
> > > > >
> > > > > Not sure this is a real regression, I'd rather avoid it, but as
> > > > > mentioned in my first answer, Mike Rapoport showed that it was making
> > > > > no difference performance-wise...
> > >
> > > My point was that if someone has hugepages enabled & we handle this in a
> > > way that causes the first hugepage to be unusable, is that not a
> > > regression? Whether hugepages provide a performance benefit is not
> > > really related to that question AFAICT.
> >
> > Not being able to map certain regions of the linear mapping with a 1GB
> > hugepage will happen, for example the kernel mapping is protected in
> > the linear mapping so that it can't be written: so we can only map
> > this region with 2MB hugepages. A firmware could mark a region as
> > "no-map" and there again we would not be able to use a 1GB hugepage. I
> > don't see that as a regression as the intention is not to *always* use
> > 1GB hugepages, but rather to use them when possible. Does that make
> > sense?
>
> Ah, so it was as I expected - I don't/didn't properly understand
> hugepages. Thanks.
>
> > > Were you suggesting reverting hugepage support entirely in your original
> > > message? If we entirely remove hugepage support, then I guess the first
> > > hugepage cannot be lost..
> >
> > Given Mike Rapoport's recent talk, I think that's an option, yes.
> >
> > >
> > > > > > > 2. we can tweak pfn_is_nosave function to *not* save pfn corresponding
> > > > > > > to PMP regions
> > > > > > > -> We don't lose the 1GB hugepage \o/
> > > > > > > 3. we can use register_nosave_region() to not save the "mmode_resv"
> > > > > > > regions (x86 does that
> > > > > > > https://elixir.bootlin.com/linux/v6.4-rc1/source/arch/x86/kernel/e820.c#L753)
> > > > > > > -> We don't lose the 1GB hugepage \o/
> > > > > > > 4. Given JeeHeng pointer to
> > > > > > > https://elixir.bootlin.com/linux/v6.4-rc1/source/kernel/power/snapshot.c#L1340,
> > > > > > > we can mark those pages as non-readable and make the hibernation
> > > > > > > process not save those pages
> > > > > > > -> Very late-in-the-day idea, not sure what it's worth, we also
> > > > > > > lose the 1GB hugepage...
> > > > > >
> > > > > > Ditto here re: introducing another regression.
> > > > > >
> > > > > > > To me, the best solution is 3 as it would prepare for other similar
> > > > > > > issues later, it is similar to x86 and it allows us to keep 1GB
> > > > > > > hugepages.
> > > > > > >
> > > > > > > I have been thinking, and to me nomap does not provide anything since
> > > > > > > the kernel should not address this memory range, so if it does, we
> > > > > > > must fix the kernel.
> > > > > > >
> > > > > > > Let me know what you all think, I'll be preparing a PoC of 3 in the meantime!
> > > > > >
> > > > > > #3 would probably get my vote too. It seems like you could use it
> > > > > > dynamically if there was to be a future other provider of "mmode_resv"
> > > > > > regions, rather than doing something location-specific.
> > > > > >
> > > > > > We should probably document these opensbi reserved memory nodes though
> > > > > > in a dt-binding or w/e if we are going to be relying on them to not
> > > > > > crash!
> > > >
> > > > Depending on a particular node name is fragile. If we really need
> > > > information from DT then I suggest adding "no-save-restore" DT
> > > > property in reserved memory nodes.
> > >
> > > We can add whatever properties we like, but where does that leave us for
> > > the systems in the wild where their reserved memory nodes do not contain
> > > a "no-save-restore" property or "no-map"?
> > >
> > > Ideally, yes, we do not depend on the node name and instead use explicit
> > > properties - but I think we may be "forced" to fall back to checking the
> > > node-name to cover the opensbi versions that do not contain one.
> > > LMK if I have missed something there!
> >
> > Yes I agree with you, we can implement Anup's solution #1, but we need
> > to fix the kernel anyway since if we don't, that would make the kernel
> > hibernation support depend on a certain version of openSBI.
>
> It's not unreasonable to have things depend on versions of the SBI
> implementation, but it is if they're not things that can be probed using
> the standard interfaces!
>
> > > > > Yes, you're right, let's see what Atish and Anup think!
> > > >
> > > > I think we have two possible approaches:
> > > >
> > > > 1) Update OpenSBI to set "no-map" DT property for firmware
> > > > reserved regions. We were doing this previously but removed
> > > > it later for performance reasons mentioned by Alex. It is also
> > > > worth mentioning that ARM Trusted Firmware also sets "no-map"
> > > > DT property for firmware reserved regions.
> > > >
> > > > 2) Add a new "no-save-restore" DT property in the reserved
> > > > memory DT bindings. The hibernate support of Linux arch/riscv
> > > > will use this DT property to exclude memory regions from
> > > > save-restore. The EFI implementation of EDK2 and U-Boot
> > > > should do the following:
> > > > 1) Treat all memory having "no-map" DT property as EFI
> > > > reserved memory
> > > > 2) Treat all memory not having "no-map" DT property and
> > > > not having "no-save-restore" DT property as EfiBootServicesData
> > > > 3) Treat all memory not having "no-map" DT property and
> > > > having "no-save-restore" DT property as EfiRuntimeServiceData
> > > > (Refer,
> > > > https://devicetree-specification.readthedocs.io/en/latest/chapter3-devicenodes.html#reserved-memory-and-uefi)
> > > >
> > > > Personally, I am leaning towards approach#1 since approach#2
> > > > will require changing DeviceTree specification as well.
> > >
> > > #1 is by far the simpler option of the two, if the consensus is that the
> > > loss of the first hugepage is not a problem (or if it is a problem that
> > > realistically is unavoidable).
> >
> > The "no-map" property does not provide much security anyway: the
> > kernel should not touch a page that is reserved (this is where I may
> > be wrong), so the real fix to this issue is to make the hibernation
> > process not save those pages.
>
> Right, the kernel clearly needs to be able to handle the regions. I, at
> least, was commenting on re-using no-map versus creating new properties
> for this situation.
> I was advocating for re-using the property & changing the kernel so as
> not to touch the regions during hibernation.
>
> > > There's something else I think I might be missing here, given the
> > > scattered nature of the reporting. This is not a problem for a system
> > > that does not implement hibernation, which was only added in v6.4-rc1?
> > >
> > > That would make it not a regression after all. I think I misunderstood
> > > the report on sw-dev to mean that this was a problem generally after
> > > v6.4-rc1, which would have been one. Could someone please confirm that?
> >
> > The problem is only present since v6.4-rc1, that's not a regression,
> > it's just that both patches landed at the same time and gave rise to
> > this issue.
>
> Sick. Glad to be wrong here!
>
> #regzbot resolve: not a regression, feature introduced this cycle
>
> > > If it only affects hibernation, and is not a regression, should we make
> > > ARCH_HIBERNATION_POSSIBLE def_bool n in Kconfig until we have agreed on
> > > a solution for the problem?
>
> Any thoughts on this one?

I concur that the conversation was somewhat fragmented. Allow me to reorganize and summarize it as follows:

The initial report, originating from the sw-dev, revealed that the Linux RISCV memory layout underwent a change in version v6.4-rc1. Based on the conversation, it was highlighted that the physical memory region between 0x80000000 to 0x80200000 (PMP region) was reserved and wasn't presented to the kernel page table in kernel versions prior to 6.4-rc1. However, with PMP protection still in place and OpenSBI having no current plans to remove the PMP flag, this region is now exposed to the kernel page table in kernel version v6.4-rc1.

The second report, which came through the mailing list, indicated that the Kernel crashes during hibernation. The author managed to isolate the suspicious patches and traced the root cause to the memory region protected by PMP. The discussion also suggested that the PMP region needs to be mapped to the page table to accommodate the 1GB huge page.

In essence, both reports concern the PMP protected memory region that's now exposed to the page table in kernel v6.4-rc1.

It is evident that we're dealing with two distinct but interdependent issues:
1. Kernel crashes during the hibernation process.
2. PMP protected region exposure to the kernel page table.

As for the first problem, we have a possible solution or workaround on hand: prevent the hibernation core from saving the PMP protected memory region, which then needs some tidying up and debug.

In relation to the second problem, potential solutions have been suggested but require further discussion. These solutions include:
a. Retain the exposure of the PMP protected region to the kernel page table, as the kernel doesn't utilize this region.
b. Revise the OpenSBI Device Tree (DT) and update the documentation for the firmware reserved region.
c. Implement the no-map option, which could potentially result in losing the 1GB huge page support.
d. Abandon the 1GB huge page support entirely.

In response to your query, I propose we proceed with the PMP workaround patch, while concurrently continuing discussions on solutions a, b, c, and d.
@Alex, feel free to reach out to me.

2023-05-18 14:14:08

by Anup Patel

[permalink] [raw]
Subject: Re: Bug report: kernel paniced when system hibernates

On Thu, May 18, 2023 at 5:39 PM Alexandre Ghiti <[email protected]> wrote:
>
> On 5/18/23 08:53, Anup Patel wrote:
> > On Wed, May 17, 2023 at 8:26 PM Alexandre Ghiti <[email protected]> wrote:
> >> On Wed, May 17, 2023 at 1:28 PM Conor Dooley <[email protected]> wrote:
> >>> Hey Alex,
> >>>
> >>> On Wed, May 17, 2023 at 10:58:02AM +0200, Alexandre Ghiti wrote:
> >>>> On Tue, May 16, 2023 at 1:12 PM Alexandre Ghiti <[email protected]> wrote:
> >>>>> On Tue, May 16, 2023 at 11:24 AM Song Shuai <[email protected]> wrote:
> >>>>> I actually removed this flag a few years ago, and I have to admit that
> >>>>> I need to check if that's necessary: the goal of commit 3335068f8721
> >>>>> ("riscv: Use PUD/P4D/PGD pages for the linear mapping") is to expose
> >>>>> the "right" start of DRAM so that we can align virtual and physical
> >>>>> addresses on a 1GB boundary.
> >>>>>
> >>>>> So I have to check if a nomap region is actually added as a
> >>>>> memblock.memory.regions[] or not: if yes, that's perfect, let's add
> >>>>> the nomap attributes to the PMP regions, otherwise, I don't think that
> >>>>> is a good solution.
> >>>> So here is the current linear mapping without nomap in openSBI:
> >>>>
> >>>> ---[ Linear mapping ]---
> >>>> 0xff60000000000000-0xff60000000200000 0x0000000080000000 2M
> >>>> PMD D A G . . W R V
> >>>> 0xff60000000200000-0xff60000000e00000 0x0000000080200000 12M
> >>>> PMD D A G . . . R V
> >>>>
> >>>> And below the linear mapping with nomap in openSBI:
> >>>>
> >>>> ---[ Linear mapping ]---
> >>>> 0xff60000000080000-0xff60000000200000 0x0000000080080000 1536K
> >>>> PTE D A G . . W R V
> >>>> 0xff60000000200000-0xff60000000e00000 0x0000000080200000 12M
> >>>> PMD D A G . . . R V
> >>>>
> >>>> So adding nomap does not misalign virtual and physical addresses, it
> >>>> prevents the usage of 1GB page for this area though, so that's a
> >>>> solution, we just lose this 1GB page here.
> >>>>
> >>>> But even though that may be the fix, I think we also need to fix that
> >>>> in the kernel as it would break compatibility with certain versions of
> >>>> openSBI *if* we fix openSBI...So here are a few solutions:
> >>>>
> >>>> 1. we can mark all "mmode_resv" nodes in the device tree as nomap,
> >>>> before the linear mapping is established (IIUC, those nodes are added
> >>>> by openSBI to advertise PMP regions)
> >>>> -> This amounts to the same fix as opensbi and we lose the 1GB hugepage.
> >>> AFAIU, losing the 1 GB hugepage is a regression, which would make this
> >>> not an option, right?
> >> Not sure this is a real regression, I'd rather avoid it, but as
> >> mentioned in my first answer, Mike Rapoport showed that it was making
> >> no difference performance-wise...
> >>
> >>>> 2. we can tweak pfn_is_nosave function to *not* save pfn corresponding
> >>>> to PMP regions
> >>>> -> We don't lose the 1GB hugepage \o/
> >>>> 3. we can use register_nosave_region() to not save the "mmode_resv"
> >>>> regions (x86 does that
> >>>> https://elixir.bootlin.com/linux/v6.4-rc1/source/arch/x86/kernel/e820.c#L753)
> >>>> -> We don't lose the 1GB hugepage \o/
> >>>> 4. Given JeeHeng pointer to
> >>>> https://elixir.bootlin.com/linux/v6.4-rc1/source/kernel/power/snapshot.c#L1340,
> >>>> we can mark those pages as non-readable and make the hibernation
> >>>> process not save those pages
> >>>> -> Very late-in-the-day idea, not sure what it's worth, we also
> >>>> lose the 1GB hugepage...
> >>> Ditto here re: introducing another regression.
> >>>
> >>>> To me, the best solution is 3 as it would prepare for other similar
> >>>> issues later, it is similar to x86 and it allows us to keep 1GB
> >>>> hugepages.
> >>>>
> >>>> I have been thinking, and to me nomap does not provide anything since
> >>>> the kernel should not address this memory range, so if it does, we
> >>>> must fix the kernel.
> >>>>
> >>>> Let me know what you all think, I'll be preparing a PoC of 3 in the meantime!
> >>> #3 would probably get my vote too. It seems like you could use it
> >>> dynamically if there was to be a future other provider of "mmode_resv"
> >>> regions, rather than doing something location-specific.
> >>>
> >>> We should probably document these opensbi reserved memory nodes though
> >>> in a dt-binding or w/e if we are going to be relying on them to not
> >>> crash!
> > Depending on a particular node name is fragile. If we really need
> > information from DT then I suggest adding "no-save-restore" DT
> > property in reserved memory nodes.
>
>
> I understand your point, the node name is the only thing I found that
> would work with current opensbi: any other idea what we could use instead?
>
>
> >> Yes, you're right, let's see what Atish and Anup think!
> > I think we have two possible approaches:
> >
> > 1) Update OpenSBI to set "no-map" DT property for firmware
> > reserved regions. We were doing this previously but removed
> > it later for performance reasons mentioned by Alex. It is also
> > worth mentioning that ARM Trusted Firmware also sets "no-map"
> > DT property for firmware reserved regions.
> >
> > 2) Add a new "no-save-restore" DT property in the reserved
> > memory DT bindings. The hibernate support of Linux arch/riscv
> > will use this DT property to exclude memory regions from
> > save-restore. The EFI implementation of EDK2 and U-Boot
> > should do the following:
> > 1) Treat all memory having "no-map" DT property as EFI
> > reserved memory
> > 2) Treat all memory not having "no-map" DT property and
> > not having "no-save-restore" DT property as EfiBootServicesData
> > 3) Treat all memory not having "no-map" DT property and
> > having "no-save-restore" DT property as EfiRuntimeServiceData
> > (Refer,
> > https://devicetree-specification.readthedocs.io/en/latest/chapter3-devicenodes.html#reserved-memory-and-uefi)
> >
> > Personally, I am leaning towards approach#1 since approach#2
> > will require changing DeviceTree specification as well.
>
>
> If needed, indeed #1 is the simplest, but I insist, to me it is not
> needed (and we don't have it in the current opensbi), if you have
> another opinion, I'm open to discuss it!

I agree with you, backward compatibility with older firmwares
is important.

Let's go with your proposed change to treat reserved DT nodes
with "mmode_resv*" name as M-mode firmware memory (it could
be any M-mode firmware). We will certainly need to document it
somewhere as an expectation of Linux RISC-V kernel.

@Sunil How about treating "mmode_resv*" as
EfiRuntimeServiceData in EDK2 ? Other reserved memory
nodes can follow the device tree specification.

Regards,
Anup

>
> Thanks for your quick answer Anup,
>
> Alex
>
>
> >
> > Regards,
> > Anup
> >
> >> Thanks for your quick answers Conor and Song, really appreciated!
> >>
> >> Alex
> >>
> >>> Thanks for working on this,
> >>> Conor.
> >>>
> > _______________________________________________
> > linux-riscv mailing list
> > [email protected]
> > http://lists.infradead.org/mailman/listinfo/linux-riscv

2023-05-24 13:57:11

by Conor Dooley

[permalink] [raw]
Subject: Re: Bug report: kernel paniced when system hibernates

Alex, Anup,

On Thu, May 18, 2023 at 07:34:16PM +0530, Anup Patel wrote:
> On Thu, May 18, 2023 at 5:39 PM Alexandre Ghiti <[email protected]> wrote:
> > On 5/18/23 08:53, Anup Patel wrote:
> > > On Wed, May 17, 2023 at 8:26 PM Alexandre Ghiti <[email protected]> wrote:
> > >> On Wed, May 17, 2023 at 1:28 PM Conor Dooley <[email protected]> wrote:
> > >>> On Wed, May 17, 2023 at 10:58:02AM +0200, Alexandre Ghiti wrote:
> > >>>> On Tue, May 16, 2023 at 1:12 PM Alexandre Ghiti <[email protected]> wrote:
> > >>>>> On Tue, May 16, 2023 at 11:24 AM Song Shuai <[email protected]> wrote:
> > >>>>> I actually removed this flag a few years ago, and I have to admit that
> > >>>>> I need to check if that's necessary: the goal of commit 3335068f8721
> > >>>>> ("riscv: Use PUD/P4D/PGD pages for the linear mapping") is to expose
> > >>>>> the "right" start of DRAM so that we can align virtual and physical
> > >>>>> addresses on a 1GB boundary.
> > >>>>>
> > >>>>> So I have to check if a nomap region is actually added as a
> > >>>>> memblock.memory.regions[] or not: if yes, that's perfect, let's add
> > >>>>> the nomap attributes to the PMP regions, otherwise, I don't think that
> > >>>>> is a good solution.
> > >>>> So here is the current linear mapping without nomap in openSBI:
> > >>>>
> > >>>> ---[ Linear mapping ]---
> > >>>> 0xff60000000000000-0xff60000000200000 0x0000000080000000 2M
> > >>>> PMD D A G . . W R V
> > >>>> 0xff60000000200000-0xff60000000e00000 0x0000000080200000 12M
> > >>>> PMD D A G . . . R V
> > >>>>
> > >>>> And below the linear mapping with nomap in openSBI:
> > >>>>
> > >>>> ---[ Linear mapping ]---
> > >>>> 0xff60000000080000-0xff60000000200000 0x0000000080080000 1536K
> > >>>> PTE D A G . . W R V
> > >>>> 0xff60000000200000-0xff60000000e00000 0x0000000080200000 12M
> > >>>> PMD D A G . . . R V
> > >>>>
> > >>>> So adding nomap does not misalign virtual and physical addresses, it
> > >>>> prevents the usage of 1GB page for this area though, so that's a
> > >>>> solution, we just lose this 1GB page here.
> > >>>>
> > >>>> But even though that may be the fix, I think we also need to fix that
> > >>>> in the kernel as it would break compatibility with certain versions of
> > >>>> openSBI *if* we fix openSBI...So here are a few solutions:
> > >>>>
> > >>>> 1. we can mark all "mmode_resv" nodes in the device tree as nomap,
> > >>>> before the linear mapping is established (IIUC, those nodes are added
> > >>>> by openSBI to advertise PMP regions)
> > >>>> -> This amounts to the same fix as opensbi and we lose the 1GB hugepage.
> > >>> AFAIU, losing the 1 GB hugepage is a regression, which would make this
> > >>> not an option, right?
> > >> Not sure this is a real regression, I'd rather avoid it, but as
> > >> mentioned in my first answer, Mike Rapoport showed that it was making
> > >> no difference performance-wise...
> > >>
> > >>>> 2. we can tweak pfn_is_nosave function to *not* save pfn corresponding
> > >>>> to PMP regions
> > >>>> -> We don't lose the 1GB hugepage \o/
> > >>>> 3. we can use register_nosave_region() to not save the "mmode_resv"
> > >>>> regions (x86 does that
> > >>>> https://elixir.bootlin.com/linux/v6.4-rc1/source/arch/x86/kernel/e820.c#L753)
> > >>>> -> We don't lose the 1GB hugepage \o/
> > >>>> 4. Given JeeHeng pointer to
> > >>>> https://elixir.bootlin.com/linux/v6.4-rc1/source/kernel/power/snapshot.c#L1340,
> > >>>> we can mark those pages as non-readable and make the hibernation
> > >>>> process not save those pages
> > >>>> -> Very late-in-the-day idea, not sure what it's worth, we also
> > >>>> lose the 1GB hugepage...
> > >>> Ditto here re: introducing another regression.
> > >>>
> > >>>> To me, the best solution is 3 as it would prepare for other similar
> > >>>> issues later, it is similar to x86 and it allows us to keep 1GB
> > >>>> hugepages.
> > >>>>
> > >>>> I have been thinking, and to me nomap does not provide anything since
> > >>>> the kernel should not address this memory range, so if it does, we
> > >>>> must fix the kernel.
> > >>>>
> > >>>> Let me know what you all think, I'll be preparing a PoC of 3 in the meantime!
> > >>> #3 would probably get my vote too. It seems like you could use it
> > >>> dynamically if there was to be a future other provider of "mmode_resv"
> > >>> regions, rather than doing something location-specific.
> > >>>
> > >>> We should probably document these opensbi reserved memory nodes though
> > >>> in a dt-binding or w/e if we are going to be relying on them to not
> > >>> crash!
> > > Depending on a particular node name is fragile. If we really need
> > > information from DT then I suggest adding "no-save-restore" DT
> > > property in reserved memory nodes.
> >
> >
> > I understand your point, the node name is the only thing I found that
> > would work with current opensbi: any other idea what we could use instead?
> >
> >
> > >> Yes, you're right, let's see what Atish and Anup think!
> > > I think we have two possible approaches:
> > >
> > > 1) Update OpenSBI to set "no-map" DT property for firmware
> > > reserved regions. We were doing this previously but removed
> > > it later for performance reasons mentioned by Alex. It is also
> > > worth mentioning that ARM Trusted Firmware also sets "no-map"
> > > DT property for firmware reserved regions.
> > >
> > > 2) Add a new "no-save-restore" DT property in the reserved
> > > memory DT bindings. The hibernate support of Linux arch/riscv
> > > will use this DT property to exclude memory regions from
> > > save-restore. The EFI implementation of EDK2 and U-Boot
> > > should do the following:
> > > 1) Treat all memory having "no-map" DT property as EFI
> > > reserved memory
> > > 2) Treat all memory not having "no-map" DT property and
> > > not having "no-save-restore" DT property as EfiBootServicesData
> > > 3) Treat all memory not having "no-map" DT property and
> > > having "no-save-restore" DT property as EfiRuntimeServiceData
> > > (Refer,
> > > https://devicetree-specification.readthedocs.io/en/latest/chapter3-devicenodes.html#reserved-memory-and-uefi)
> > >
> > > Personally, I am leaning towards approach#1 since approach#2
> > > will require changing DeviceTree specification as well.
> >
> >
> > If needed, indeed #1 is the simplest, but I insist, to me it is not
> > needed (and we don't have it in the current opensbi), if you have
> > another opinion, I'm open to discuss it!
>
> I agree with you, backward compatibility with older firmwares
> is important.
>
> Let's go with your proposed change to treat reserved DT nodes
> with "mmode_resv*" name as M-mode firmware memory (it could
> be any M-mode firmware). We will certainly need to document it
> somewhere as an expectation of Linux RISC-V kernel.

Actually, you two both probably know the answer to this, but was there a
release done of OpenSBI where the reserved memory region was not
specified to be no-map?

>
> @Sunil How about treating "mmode_resv*" as
> EfiRuntimeServiceData in EDK2 ? Other reserved memory
> nodes can follow the device tree specification.


Attachments:
(No filename) (7.24 kB)
signature.asc (235.00 B)
Download all attachments

2023-05-24 14:13:05

by Alexandre Ghiti

[permalink] [raw]
Subject: Re: Bug report: kernel paniced when system hibernates

Hi Conor,

On Wed, May 24, 2023 at 3:49 PM Conor Dooley <[email protected]> wrote:
>
> Alex, Anup,
>
> On Thu, May 18, 2023 at 07:34:16PM +0530, Anup Patel wrote:
> > On Thu, May 18, 2023 at 5:39 PM Alexandre Ghiti <[email protected]> wrote:
> > > On 5/18/23 08:53, Anup Patel wrote:
> > > > On Wed, May 17, 2023 at 8:26 PM Alexandre Ghiti <[email protected]> wrote:
> > > >> On Wed, May 17, 2023 at 1:28 PM Conor Dooley <[email protected]> wrote:
> > > >>> On Wed, May 17, 2023 at 10:58:02AM +0200, Alexandre Ghiti wrote:
> > > >>>> On Tue, May 16, 2023 at 1:12 PM Alexandre Ghiti <[email protected]> wrote:
> > > >>>>> On Tue, May 16, 2023 at 11:24 AM Song Shuai <[email protected]> wrote:
> > > >>>>> I actually removed this flag a few years ago, and I have to admit that
> > > >>>>> I need to check if that's necessary: the goal of commit 3335068f8721
> > > >>>>> ("riscv: Use PUD/P4D/PGD pages for the linear mapping") is to expose
> > > >>>>> the "right" start of DRAM so that we can align virtual and physical
> > > >>>>> addresses on a 1GB boundary.
> > > >>>>>
> > > >>>>> So I have to check if a nomap region is actually added as a
> > > >>>>> memblock.memory.regions[] or not: if yes, that's perfect, let's add
> > > >>>>> the nomap attributes to the PMP regions, otherwise, I don't think that
> > > >>>>> is a good solution.
> > > >>>> So here is the current linear mapping without nomap in openSBI:
> > > >>>>
> > > >>>> ---[ Linear mapping ]---
> > > >>>> 0xff60000000000000-0xff60000000200000 0x0000000080000000 2M
> > > >>>> PMD D A G . . W R V
> > > >>>> 0xff60000000200000-0xff60000000e00000 0x0000000080200000 12M
> > > >>>> PMD D A G . . . R V
> > > >>>>
> > > >>>> And below the linear mapping with nomap in openSBI:
> > > >>>>
> > > >>>> ---[ Linear mapping ]---
> > > >>>> 0xff60000000080000-0xff60000000200000 0x0000000080080000 1536K
> > > >>>> PTE D A G . . W R V
> > > >>>> 0xff60000000200000-0xff60000000e00000 0x0000000080200000 12M
> > > >>>> PMD D A G . . . R V
> > > >>>>
> > > >>>> So adding nomap does not misalign virtual and physical addresses, it
> > > >>>> prevents the usage of 1GB page for this area though, so that's a
> > > >>>> solution, we just lose this 1GB page here.
> > > >>>>
> > > >>>> But even though that may be the fix, I think we also need to fix that
> > > >>>> in the kernel as it would break compatibility with certain versions of
> > > >>>> openSBI *if* we fix openSBI...So here are a few solutions:
> > > >>>>
> > > >>>> 1. we can mark all "mmode_resv" nodes in the device tree as nomap,
> > > >>>> before the linear mapping is established (IIUC, those nodes are added
> > > >>>> by openSBI to advertise PMP regions)
> > > >>>> -> This amounts to the same fix as opensbi and we lose the 1GB hugepage.
> > > >>> AFAIU, losing the 1 GB hugepage is a regression, which would make this
> > > >>> not an option, right?
> > > >> Not sure this is a real regression, I'd rather avoid it, but as
> > > >> mentioned in my first answer, Mike Rapoport showed that it was making
> > > >> no difference performance-wise...
> > > >>
> > > >>>> 2. we can tweak pfn_is_nosave function to *not* save pfn corresponding
> > > >>>> to PMP regions
> > > >>>> -> We don't lose the 1GB hugepage \o/
> > > >>>> 3. we can use register_nosave_region() to not save the "mmode_resv"
> > > >>>> regions (x86 does that
> > > >>>> https://elixir.bootlin.com/linux/v6.4-rc1/source/arch/x86/kernel/e820.c#L753)
> > > >>>> -> We don't lose the 1GB hugepage \o/
> > > >>>> 4. Given JeeHeng pointer to
> > > >>>> https://elixir.bootlin.com/linux/v6.4-rc1/source/kernel/power/snapshot.c#L1340,
> > > >>>> we can mark those pages as non-readable and make the hibernation
> > > >>>> process not save those pages
> > > >>>> -> Very late-in-the-day idea, not sure what it's worth, we also
> > > >>>> lose the 1GB hugepage...
> > > >>> Ditto here re: introducing another regression.
> > > >>>
> > > >>>> To me, the best solution is 3 as it would prepare for other similar
> > > >>>> issues later, it is similar to x86 and it allows us to keep 1GB
> > > >>>> hugepages.
> > > >>>>
> > > >>>> I have been thinking, and to me nomap does not provide anything since
> > > >>>> the kernel should not address this memory range, so if it does, we
> > > >>>> must fix the kernel.
> > > >>>>
> > > >>>> Let me know what you all think, I'll be preparing a PoC of 3 in the meantime!
> > > >>> #3 would probably get my vote too. It seems like you could use it
> > > >>> dynamically if there was to be a future other provider of "mmode_resv"
> > > >>> regions, rather than doing something location-specific.
> > > >>>
> > > >>> We should probably document these opensbi reserved memory nodes though
> > > >>> in a dt-binding or w/e if we are going to be relying on them to not
> > > >>> crash!
> > > > Depending on a particular node name is fragile. If we really need
> > > > information from DT then I suggest adding "no-save-restore" DT
> > > > property in reserved memory nodes.
> > >
> > >
> > > I understand your point, the node name is the only thing I found that
> > > would work with current opensbi: any other idea what we could use instead?
> > >
> > >
> > > >> Yes, you're right, let's see what Atish and Anup think!
> > > > I think we have two possible approaches:
> > > >
> > > > 1) Update OpenSBI to set "no-map" DT property for firmware
> > > > reserved regions. We were doing this previously but removed
> > > > it later for performance reasons mentioned by Alex. It is also
> > > > worth mentioning that ARM Trusted Firmware also sets "no-map"
> > > > DT property for firmware reserved regions.
> > > >
> > > > 2) Add a new "no-save-restore" DT property in the reserved
> > > > memory DT bindings. The hibernate support of Linux arch/riscv
> > > > will use this DT property to exclude memory regions from
> > > > save-restore. The EFI implementation of EDK2 and U-Boot
> > > > should do the following:
> > > > 1) Treat all memory having "no-map" DT property as EFI
> > > > reserved memory
> > > > 2) Treat all memory not having "no-map" DT property and
> > > > not having "no-save-restore" DT property as EfiBootServicesData
> > > > 3) Treat all memory not having "no-map" DT property and
> > > > having "no-save-restore" DT property as EfiRuntimeServiceData
> > > > (Refer,
> > > > https://devicetree-specification.readthedocs.io/en/latest/chapter3-devicenodes.html#reserved-memory-and-uefi)
> > > >
> > > > Personally, I am leaning towards approach#1 since approach#2
> > > > will require changing DeviceTree specification as well.
> > >
> > >
> > > If needed, indeed #1 is the simplest, but I insist, to me it is not
> > > needed (and we don't have it in the current opensbi), if you have
> > > another opinion, I'm open to discuss it!
> >
> > I agree with you, backward compatibility with older firmwares
> > is important.
> >
> > Let's go with your proposed change to treat reserved DT nodes
> > with "mmode_resv*" name as M-mode firmware memory (it could
> > be any M-mode firmware). We will certainly need to document it
> > somewhere as an expectation of Linux RISC-V kernel.
>
> Actually, you two both probably know the answer to this, but was there a
> release done of OpenSBI where the reserved memory region was not
> specified to be no-map?

The reserved memory regions are *currently* not specified to be no-map
and that since v0.8: the culprit is commit 6966ad0abe70
("platform/lib: Allow the OS to map the regions that are protected by
PMP").

So to make sure we are on the same page:

- from 0.1 to 0.7 => reserved memory regions were marked as no-map
- starting from 0.8 => reserved memory regions are *not* marked as
no-map anymore

Hope that's clear!

Alex

>
> >
> > @Sunil How about treating "mmode_resv*" as
> > EfiRuntimeServiceData in EDK2 ? Other reserved memory
> > nodes can follow the device tree specification.
>

2023-05-24 16:11:50

by Conor Dooley

[permalink] [raw]
Subject: Re: Bug report: kernel paniced when system hibernates

On Wed, May 24, 2023 at 03:57:20PM +0200, Alexandre Ghiti wrote:
> The reserved memory regions are *currently* not specified to be no-map
> and that since v0.8: the culprit is commit 6966ad0abe70
> ("platform/lib: Allow the OS to map the regions that are protected by
> PMP").
>
> So to make sure we are on the same page:
>
> - from 0.1 to 0.7 => reserved memory regions were marked as no-map
> - starting from 0.8 => reserved memory regions are *not* marked as
> no-map anymore
>
> Hope that's clear!

Yes it was helpful - thanks.


Attachments:
(No filename) (550.00 B)
signature.asc (235.00 B)
Download all attachments

2023-05-25 00:05:23

by Atish Patra

[permalink] [raw]
Subject: Re: Bug report: kernel paniced when system hibernates

On Thu, May 18, 2023 at 7:04 AM Anup Patel <[email protected]> wrote:
>
> On Thu, May 18, 2023 at 5:39 PM Alexandre Ghiti <[email protected]> wrote:
> >
> > On 5/18/23 08:53, Anup Patel wrote:
> > > On Wed, May 17, 2023 at 8:26 PM Alexandre Ghiti <[email protected]> wrote:
> > >> On Wed, May 17, 2023 at 1:28 PM Conor Dooley <[email protected]> wrote:
> > >>> Hey Alex,
> > >>>
> > >>> On Wed, May 17, 2023 at 10:58:02AM +0200, Alexandre Ghiti wrote:
> > >>>> On Tue, May 16, 2023 at 1:12 PM Alexandre Ghiti <[email protected]> wrote:
> > >>>>> On Tue, May 16, 2023 at 11:24 AM Song Shuai <[email protected]> wrote:
> > >>>>> I actually removed this flag a few years ago, and I have to admit that
> > >>>>> I need to check if that's necessary: the goal of commit 3335068f8721
> > >>>>> ("riscv: Use PUD/P4D/PGD pages for the linear mapping") is to expose
> > >>>>> the "right" start of DRAM so that we can align virtual and physical
> > >>>>> addresses on a 1GB boundary.
> > >>>>>
> > >>>>> So I have to check if a nomap region is actually added as a
> > >>>>> memblock.memory.regions[] or not: if yes, that's perfect, let's add
> > >>>>> the nomap attributes to the PMP regions, otherwise, I don't think that
> > >>>>> is a good solution.
> > >>>> So here is the current linear mapping without nomap in openSBI:
> > >>>>
> > >>>> ---[ Linear mapping ]---
> > >>>> 0xff60000000000000-0xff60000000200000 0x0000000080000000 2M
> > >>>> PMD D A G . . W R V
> > >>>> 0xff60000000200000-0xff60000000e00000 0x0000000080200000 12M
> > >>>> PMD D A G . . . R V
> > >>>>
> > >>>> And below the linear mapping with nomap in openSBI:
> > >>>>
> > >>>> ---[ Linear mapping ]---
> > >>>> 0xff60000000080000-0xff60000000200000 0x0000000080080000 1536K
> > >>>> PTE D A G . . W R V
> > >>>> 0xff60000000200000-0xff60000000e00000 0x0000000080200000 12M
> > >>>> PMD D A G . . . R V
> > >>>>
> > >>>> So adding nomap does not misalign virtual and physical addresses, it
> > >>>> prevents the usage of 1GB page for this area though, so that's a
> > >>>> solution, we just lose this 1GB page here.
> > >>>>
> > >>>> But even though that may be the fix, I think we also need to fix that
> > >>>> in the kernel as it would break compatibility with certain versions of
> > >>>> openSBI *if* we fix openSBI...So here are a few solutions:
> > >>>>
> > >>>> 1. we can mark all "mmode_resv" nodes in the device tree as nomap,
> > >>>> before the linear mapping is established (IIUC, those nodes are added
> > >>>> by openSBI to advertise PMP regions)
> > >>>> -> This amounts to the same fix as opensbi and we lose the 1GB hugepage.
> > >>> AFAIU, losing the 1 GB hugepage is a regression, which would make this
> > >>> not an option, right?
> > >> Not sure this is a real regression, I'd rather avoid it, but as
> > >> mentioned in my first answer, Mike Rapoport showed that it was making
> > >> no difference performance-wise...
> > >>
> > >>>> 2. we can tweak pfn_is_nosave function to *not* save pfn corresponding
> > >>>> to PMP regions
> > >>>> -> We don't lose the 1GB hugepage \o/
> > >>>> 3. we can use register_nosave_region() to not save the "mmode_resv"
> > >>>> regions (x86 does that
> > >>>> https://elixir.bootlin.com/linux/v6.4-rc1/source/arch/x86/kernel/e820.c#L753)
> > >>>> -> We don't lose the 1GB hugepage \o/
> > >>>> 4. Given JeeHeng pointer to
> > >>>> https://elixir.bootlin.com/linux/v6.4-rc1/source/kernel/power/snapshot.c#L1340,
> > >>>> we can mark those pages as non-readable and make the hibernation
> > >>>> process not save those pages
> > >>>> -> Very late-in-the-day idea, not sure what it's worth, we also
> > >>>> lose the 1GB hugepage...
> > >>> Ditto here re: introducing another regression.
> > >>>
> > >>>> To me, the best solution is 3 as it would prepare for other similar
> > >>>> issues later, it is similar to x86 and it allows us to keep 1GB
> > >>>> hugepages.
> > >>>>
> > >>>> I have been thinking, and to me nomap does not provide anything since
> > >>>> the kernel should not address this memory range, so if it does, we
> > >>>> must fix the kernel.
> > >>>>
> > >>>> Let me know what you all think, I'll be preparing a PoC of 3 in the meantime!
> > >>> #3 would probably get my vote too. It seems like you could use it
> > >>> dynamically if there was to be a future other provider of "mmode_resv"
> > >>> regions, rather than doing something location-specific.
> > >>>
> > >>> We should probably document these opensbi reserved memory nodes though
> > >>> in a dt-binding or w/e if we are going to be relying on them to not
> > >>> crash!
> > > Depending on a particular node name is fragile. If we really need
> > > information from DT then I suggest adding "no-save-restore" DT
> > > property in reserved memory nodes.
> >
> >
> > I understand your point, the node name is the only thing I found that
> > would work with current opensbi: any other idea what we could use instead?
> >
> >
> > >> Yes, you're right, let's see what Atish and Anup think!
> > > I think we have two possible approaches:
> > >
> > > 1) Update OpenSBI to set "no-map" DT property for firmware
> > > reserved regions. We were doing this previously but removed
> > > it later for performance reasons mentioned by Alex. It is also
> > > worth mentioning that ARM Trusted Firmware also sets "no-map"
> > > DT property for firmware reserved regions.
> > >
> > > 2) Add a new "no-save-restore" DT property in the reserved
> > > memory DT bindings. The hibernate support of Linux arch/riscv
> > > will use this DT property to exclude memory regions from
> > > save-restore. The EFI implementation of EDK2 and U-Boot
> > > should do the following:
> > > 1) Treat all memory having "no-map" DT property as EFI
> > > reserved memory
> > > 2) Treat all memory not having "no-map" DT property and
> > > not having "no-save-restore" DT property as EfiBootServicesData
> > > 3) Treat all memory not having "no-map" DT property and
> > > having "no-save-restore" DT property as EfiRuntimeServiceData
> > > (Refer,
> > > https://devicetree-specification.readthedocs.io/en/latest/chapter3-devicenodes.html#reserved-memory-and-uefi)
> > >
> > > Personally, I am leaning towards approach#1 since approach#2
> > > will require changing DeviceTree specification as well.
> >
> >
> > If needed, indeed #1 is the simplest, but I insist, to me it is not
> > needed (and we don't have it in the current opensbi), if you have
> > another opinion, I'm open to discuss it!
>

The problem with relying on the "mmode_resv" name is that there will be
other use cases where a portion of the memory must be reserved and accessing
that from the kernel will result in fault. CoVE is such a use case where
TSM will probably run from a memory region with confidential memory
which the kernel
must not access.

We have to name it "mmode_resv" as well or mark it as "no-map" which will
present a hole in mappings. We will end up in same 1GB hugepage issue
if we choose
"no-map".

Another option is to use compatible string or label property to indicate
that this memory region is not to be saved/restored during hibernation.
This can be documented in RISC-V DT bindings as well as the booting guide
doc that alex was talking about.


> I agree with you, backward compatibility with older firmwares
> is important.
>
This does break the compatibility with the older firmware w.r.to hibernation
feature. However

It is specific to hibernation only. So hibernation will fail to work
if an user is running kernel > 6.4 but 0.8 < OpenSBI < 1.2

The same problem lies if users use other firmware that don't have
no-map property today. IMO, this can be documented as a known problem.


> Let's go with your proposed change to treat reserved DT nodes
> with "mmode_resv*" name as M-mode firmware memory (it could
> be any M-mode firmware). We will certainly need to document it
> somewhere as an expectation of Linux RISC-V kernel.
>
> @Sunil How about treating "mmode_resv*" as
> EfiRuntimeServiceData in EDK2 ? Other reserved memory
> nodes can follow the device tree specification.
>

Either way, we also need to fix U-Boot to match the behavior. Currently,
it treats any reserved memory without no-map property as EFI_BOOT_SERVICES_DATA.

> Regards,
> Anup
>
> >
> > Thanks for your quick answer Anup,
> >
> > Alex
> >
> >
> > >
> > > Regards,
> > > Anup
> > >
> > >> Thanks for your quick answers Conor and Song, really appreciated!
> > >>
> > >> Alex
> > >>
> > >>> Thanks for working on this,
> > >>> Conor.
> > >>>
> > > _______________________________________________
> > > linux-riscv mailing list
> > > [email protected]
> > > http://lists.infradead.org/mailman/listinfo/linux-riscv
>
> _______________________________________________
> linux-riscv mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-riscv



--
Regards,
Atish

2023-05-25 00:51:03

by Atish Patra

[permalink] [raw]
Subject: Re: Bug report: kernel paniced when system hibernates

On Thu, May 18, 2023 at 4:58 AM Alexandre Ghiti <[email protected]> wrote:
>
> On Thu, May 18, 2023 at 12:35 PM Conor Dooley
> <[email protected]> wrote:
> >
> > On Thu, May 18, 2023 at 10:41:19AM +0200, Alexandre Ghiti wrote:
> > > On Thu, May 18, 2023 at 10:00 AM Conor Dooley
> > > <[email protected]> wrote:
> > > > On Thu, May 18, 2023 at 12:23:59PM +0530, Anup Patel wrote:
> > > > > On Wed, May 17, 2023 at 8:26 PM Alexandre Ghiti <[email protected]> wrote:
> > > > > > On Wed, May 17, 2023 at 1:28 PM Conor Dooley <[email protected]> wrote:
> > > > > > > On Wed, May 17, 2023 at 10:58:02AM +0200, Alexandre Ghiti wrote:
> > > > > > > > On Tue, May 16, 2023 at 1:12 PM Alexandre Ghiti <[email protected]> wrote:
> > > > > > >
> > > > > > > > > On Tue, May 16, 2023 at 11:24 AM Song Shuai <[email protected]> wrote:
> > > > > > > > > I actually removed this flag a few years ago, and I have to admit that
> > > > > > > > > I need to check if that's necessary: the goal of commit 3335068f8721
> > > > > > > > > ("riscv: Use PUD/P4D/PGD pages for the linear mapping") is to expose
> > > > > > > > > the "right" start of DRAM so that we can align virtual and physical
> > > > > > > > > addresses on a 1GB boundary.
> > > > > > > > >
> > > > > > > > > So I have to check if a nomap region is actually added as a
> > > > > > > > > memblock.memory.regions[] or not: if yes, that's perfect, let's add
> > > > > > > > > the nomap attributes to the PMP regions, otherwise, I don't think that
> > > > > > > > > is a good solution.
> > > > > > > >
> > > > > > > > So here is the current linear mapping without nomap in openSBI:
> > > > > > > >
> > > > > > > > ---[ Linear mapping ]---
> > > > > > > > 0xff60000000000000-0xff60000000200000 0x0000000080000000 2M
> > > > > > > > PMD D A G . . W R V
> > > > > > > > 0xff60000000200000-0xff60000000e00000 0x0000000080200000 12M
> > > > > > > > PMD D A G . . . R V
> > > > > > > >
> > > > > > > > And below the linear mapping with nomap in openSBI:
> > > > > > > >
> > > > > > > > ---[ Linear mapping ]---
> > > > > > > > 0xff60000000080000-0xff60000000200000 0x0000000080080000 1536K
> > > > > > > > PTE D A G . . W R V
> > > > > > > > 0xff60000000200000-0xff60000000e00000 0x0000000080200000 12M
> > > > > > > > PMD D A G . . . R V
> > > > > > > >
> > > > > > > > So adding nomap does not misalign virtual and physical addresses, it
> > > > > > > > prevents the usage of 1GB page for this area though, so that's a
> > > > > > > > solution, we just lose this 1GB page here.
> > > > > > > >
> > > > > > > > But even though that may be the fix, I think we also need to fix that
> > > > > > > > in the kernel as it would break compatibility with certain versions of
> > > > > > > > openSBI *if* we fix openSBI...So here are a few solutions:
> > > > > > > >
> > > > > > > > 1. we can mark all "mmode_resv" nodes in the device tree as nomap,
> > > > > > > > before the linear mapping is established (IIUC, those nodes are added
> > > > > > > > by openSBI to advertise PMP regions)
> > > > > > > > -> This amounts to the same fix as opensbi and we lose the 1GB hugepage.
> > > > > > >
> > > > > > > AFAIU, losing the 1 GB hugepage is a regression, which would make this
> > > > > > > not an option, right?
> > > > > >
> > > > > > Not sure this is a real regression, I'd rather avoid it, but as
> > > > > > mentioned in my first answer, Mike Rapoport showed that it was making
> > > > > > no difference performance-wise...
> > > >
> > > > My point was that if someone has hugepages enabled & we handle this in a
> > > > way that causes the first hugepage to be unusable, is that not a
> > > > regression? Whether hugepages provide a performance benefit is not
> > > > really related to that question AFAICT.
> > >
> > > Not being able to map certain regions of the linear mapping with a 1GB
> > > hugepage will happen, for example the kernel mapping is protected in
> > > the linear mapping so that it can't be written: so we can only map
> > > this region with 2MB hugepages. A firmware could mark a region as
> > > "no-map" and there again we would not be able to use a 1GB hugepage. I
> > > don't see that as a regression as the intention is not to *always* use
> > > 1GB hugepages, but rather to use them when possible. Does that make
> > > sense?
> >
> > Ah, so it was as I expected - I don't/didn't properly understand
> > hugepages. Thanks.
> >
> > > > Were you suggesting reverting hugepage support entirely in your original
> > > > message? If we entirely remove hugepage support, then I guess the first
> > > > hugepage cannot be lost..
> > >
> > > Given Mike Rapoport's recent talk, I think that's an option, yes.
> > >
> > > >
> > > > > > > > 2. we can tweak pfn_is_nosave function to *not* save pfn corresponding
> > > > > > > > to PMP regions
> > > > > > > > -> We don't lose the 1GB hugepage \o/
> > > > > > > > 3. we can use register_nosave_region() to not save the "mmode_resv"
> > > > > > > > regions (x86 does that
> > > > > > > > https://elixir.bootlin.com/linux/v6.4-rc1/source/arch/x86/kernel/e820.c#L753)
> > > > > > > > -> We don't lose the 1GB hugepage \o/
> > > > > > > > 4. Given JeeHeng pointer to
> > > > > > > > https://elixir.bootlin.com/linux/v6.4-rc1/source/kernel/power/snapshot.c#L1340,
> > > > > > > > we can mark those pages as non-readable and make the hibernation
> > > > > > > > process not save those pages
> > > > > > > > -> Very late-in-the-day idea, not sure what it's worth, we also
> > > > > > > > lose the 1GB hugepage...
> > > > > > >
> > > > > > > Ditto here re: introducing another regression.
> > > > > > >
> > > > > > > > To me, the best solution is 3 as it would prepare for other similar
> > > > > > > > issues later, it is similar to x86 and it allows us to keep 1GB
> > > > > > > > hugepages.
> > > > > > > >
> > > > > > > > I have been thinking, and to me nomap does not provide anything since
> > > > > > > > the kernel should not address this memory range, so if it does, we
> > > > > > > > must fix the kernel.
> > > > > > > >
> > > > > > > > Let me know what you all think, I'll be preparing a PoC of 3 in the meantime!
> > > > > > >
> > > > > > > #3 would probably get my vote too. It seems like you could use it
> > > > > > > dynamically if there was to be a future other provider of "mmode_resv"
> > > > > > > regions, rather than doing something location-specific.
> > > > > > >
> > > > > > > We should probably document these opensbi reserved memory nodes though
> > > > > > > in a dt-binding or w/e if we are going to be relying on them to not
> > > > > > > crash!
> > > > >
> > > > > Depending on a particular node name is fragile. If we really need
> > > > > information from DT then I suggest adding "no-save-restore" DT
> > > > > property in reserved memory nodes.
> > > >
> > > > We can add whatever properties we like, but where does that leave us for
> > > > the systems in the wild where their reserved memory nodes do not contain
> > > > a "no-save-restore" property or "no-map"?
> > > >
> > > > Ideally, yes, we do not depend on the node name and instead use explicit
> > > > properties - but I think we may be "forced" to fall back to checking the
> > > > node-name to cover the opensbi versions that do not contain one.
> > > > LMK if I have missed something there!
> > >
> > > Yes I agree with you, we can implement Anup's solution #1, but we need
> > > to fix the kernel anyway since if we don't, that would make the kernel
> > > hibernation support depend on a certain version of openSBI.
> >
> > It's not unreasonable to have things depend on versions of the SBI
> > implementation, but it is if they're not things that can be probed using
> > the standard interfaces!
> >
> > > > > > Yes, you're right, let's see what Atish and Anup think!
> > > > >
> > > > > I think we have two possible approaches:
> > > > >
> > > > > 1) Update OpenSBI to set "no-map" DT property for firmware
> > > > > reserved regions. We were doing this previously but removed
> > > > > it later for performance reasons mentioned by Alex. It is also
> > > > > worth mentioning that ARM Trusted Firmware also sets "no-map"
> > > > > DT property for firmware reserved regions.
> > > > >
> > > > > 2) Add a new "no-save-restore" DT property in the reserved
> > > > > memory DT bindings. The hibernate support of Linux arch/riscv
> > > > > will use this DT property to exclude memory regions from
> > > > > save-restore. The EFI implementation of EDK2 and U-Boot
> > > > > should do the following:
> > > > > 1) Treat all memory having "no-map" DT property as EFI
> > > > > reserved memory
> > > > > 2) Treat all memory not having "no-map" DT property and
> > > > > not having "no-save-restore" DT property as EfiBootServicesData
> > > > > 3) Treat all memory not having "no-map" DT property and
> > > > > having "no-save-restore" DT property as EfiRuntimeServiceData
> > > > > (Refer,
> > > > > https://devicetree-specification.readthedocs.io/en/latest/chapter3-devicenodes.html#reserved-memory-and-uefi)
> > > > >
> > > > > Personally, I am leaning towards approach#1 since approach#2
> > > > > will require changing DeviceTree specification as well.
> > > >
> > > > #1 is by far the simpler option of the two, if the consensus is that the
> > > > loss of the first hugepage is not a problem (or if it is a problem that
> > > > realistically is unavoidable).
> > >
> > > The "no-map" property does not provide much security anyway: the
> > > kernel should not touch a page that is reserved (this is where I may
> > > be wrong), so the real fix to this issue is to make the hibernation
> > > process not save those pages.
> >
> > Right, the kernel clearly needs to be able to handle the regions. I, at
> > least, was commenting on re-using no-map versus creating new properties
> > for this situation.
> > I was advocating for re-using the property & changing the kernel so as
> > not to touch the regions during hibernation.
> >
> > > > There's something else I think I might be missing here, given the
> > > > scattered nature of the reporting. This is not a problem for a system
> > > > that does not implement hibernation, which was only added in v6.4-rc1?
> > > >
> > > > That would make it not a regression after all. I think I misunderstood
> > > > the report on sw-dev to mean that this was a problem generally after
> > > > v6.4-rc1, which would have been one. Could someone please confirm that?
> > >
> > > The problem is only present since v6.4-rc1, that's not a regression,
> > > it's just that both patches landed at the same time and gave rise to
> > > this issue.
> >
> > Sick. Glad to be wrong here!
> >
> > #regzbot resolve: not a regression, feature introduced this cycle
> >
> > > > If it only affects hibernation, and is not a regression, should we make
> > > > ARCH_HIBERNATION_POSSIBLE def_bool n in Kconfig until we have agreed on
> > > > a solution for the problem?
> >
> > Any thoughts on this one?
>
> Ahah, I don't know, I tried to dodge the question :) But if we don't
> fix this issue, hibernation won't work so is it enough?
>

We can also just revert the commit
3335068 ("riscv: Use PUD/P4D/PGD pages for the linear mapping")
until a proper solution is available.

IIRC, Alex also suggested the same as there is no measured benefit
with this patch.
It is good to have 1GB huge page mappings but it can be delayed until
we have the fix
for the hibernation issue.

> _______________________________________________
> linux-riscv mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-riscv



--
Regards,
Atish

2023-05-25 13:00:23

by Conor Dooley

[permalink] [raw]
Subject: Re: Bug report: kernel paniced when system hibernates

On Wed, May 24, 2023 at 04:53:59PM -0700, Atish Patra wrote:

> We can also just revert the commit
> 3335068 ("riscv: Use PUD/P4D/PGD pages for the linear mapping")
> until a proper solution is available.

I'm actually inclined to go the other day. If hibernation never works
in a released kernel we can't regress it & since we have not settled on
a fix for this yet we don't know whether the fix would then remove
support for hibernation where it used to work.

> IIRC, Alex also suggested the same as there is no measured benefit
> with this patch.
> It is good to have 1GB huge page mappings but it can be delayed until
> we have the fix
> for the hibernation issue.

IIRC it was the 1 GiB hugepages that Alex said were ~worthless.


Attachments:
(No filename) (753.00 B)
signature.asc (235.00 B)
Download all attachments

2023-05-25 13:20:45

by Conor Dooley

[permalink] [raw]
Subject: Re: Bug report: kernel paniced when system hibernates

On Wed, May 24, 2023 at 04:45:36PM -0700, Atish Patra wrote:
> On Thu, May 18, 2023 at 7:04 AM Anup Patel <[email protected]> wrote:
> > On Thu, May 18, 2023 at 5:39 PM Alexandre Ghiti <[email protected]> wrote:
> > > On 5/18/23 08:53, Anup Patel wrote:
> > > > On Wed, May 17, 2023 at 8:26 PM Alexandre Ghiti <[email protected]> wrote:
> > > >> On Wed, May 17, 2023 at 1:28 PM Conor Dooley <[email protected]> wrote:

> > > > I think we have two possible approaches:
> > > >
> > > > 1) Update OpenSBI to set "no-map" DT property for firmware
> > > > reserved regions. We were doing this previously but removed
> > > > it later for performance reasons mentioned by Alex. It is also
> > > > worth mentioning that ARM Trusted Firmware also sets "no-map"
> > > > DT property for firmware reserved regions.
> > > >
> > > > 2) Add a new "no-save-restore" DT property in the reserved
> > > > memory DT bindings. The hibernate support of Linux arch/riscv
> > > > will use this DT property to exclude memory regions from
> > > > save-restore. The EFI implementation of EDK2 and U-Boot
> > > > should do the following:
> > > > 1) Treat all memory having "no-map" DT property as EFI
> > > > reserved memory
> > > > 2) Treat all memory not having "no-map" DT property and
> > > > not having "no-save-restore" DT property as EfiBootServicesData
> > > > 3) Treat all memory not having "no-map" DT property and
> > > > having "no-save-restore" DT property as EfiRuntimeServiceData
> > > > (Refer,
> > > > https://devicetree-specification.readthedocs.io/en/latest/chapter3-devicenodes.html#reserved-memory-and-uefi)
> > > >
> > > > Personally, I am leaning towards approach#1 since approach#2
> > > > will require changing DeviceTree specification as well.
> > >
> > >
> > > If needed, indeed #1 is the simplest, but I insist, to me it is not
> > > needed (and we don't have it in the current opensbi), if you have
> > > another opinion, I'm open to discuss it!
> >
>
> The problem with relying on the "mmode_resv" name is that there will be
> other use cases where a portion of the memory must be reserved and accessing
> that from the kernel will result in fault. CoVE is such a use case where
> TSM will probably run from a memory region with confidential memory
> which the kernel must not access.

We should only rely on this node name for known bad versions of opensbi
IMO. Going forward, if something needs to be reserved for firmware, the
firmware should make sure that it is reserved by using the property for
that purpose :)

> We have to name it "mmode_resv" as well or mark it as "no-map" which will
> present a hole in mappings. We will end up in same 1GB hugepage issue
> if we choose "no-map".
>
> Another option is to use compatible string or label property to indicate
> that this memory region is not to be saved/restored during hibernation.
> This can be documented in RISC-V DT bindings as well as the booting guide
> doc that alex was talking about.

Sure, a dt-binding for sbi reserved regions doesn't immediately sound
like an awful idea... But we still have to work around the borked
firmware - be that disabling hibernation or using the mmode_resv node
when we know that the version of OpenSBI is one of those with the
problem.

> > I agree with you, backward compatibility with older firmwares
> > is important.
> >
> This does break the compatibility with the older firmware w.r.to hibernation
> feature. However
>
> It is specific to hibernation only. So hibernation will fail to work
> if an user is running kernel > 6.4 but 0.8 < OpenSBI < 1.2
>
> The same problem lies if users use other firmware that don't have
> no-map property today. IMO, this can be documented as a known problem.

I'd rather we disabled it than documented it as broken.
Or disable _and_ document it.

> > Let's go with your proposed change to treat reserved DT nodes
> > with "mmode_resv*" name as M-mode firmware memory (it could
> > be any M-mode firmware). We will certainly need to document it
> > somewhere as an expectation of Linux RISC-V kernel.
> >
> > @Sunil How about treating "mmode_resv*" as
> > EfiRuntimeServiceData in EDK2 ? Other reserved memory
> > nodes can follow the device tree specification.
> >
>
> Either way, we also need to fix U-Boot to match the behavior. Currently,
> it treats any reserved memory without no-map property as EFI_BOOT_SERVICES_DATA.

Cheers,
Conor.


Attachments:
(No filename) (4.49 kB)
signature.asc (235.00 B)
Download all attachments

2023-05-25 13:39:21

by Anup Patel

[permalink] [raw]
Subject: Re: Bug report: kernel paniced when system hibernates

On Thu, May 25, 2023 at 6:39 PM Conor Dooley <[email protected]> wrote:
>
> On Wed, May 24, 2023 at 04:45:36PM -0700, Atish Patra wrote:
> > On Thu, May 18, 2023 at 7:04 AM Anup Patel <[email protected]> wrote:
> > > On Thu, May 18, 2023 at 5:39 PM Alexandre Ghiti <[email protected]> wrote:
> > > > On 5/18/23 08:53, Anup Patel wrote:
> > > > > On Wed, May 17, 2023 at 8:26 PM Alexandre Ghiti <[email protected]> wrote:
> > > > >> On Wed, May 17, 2023 at 1:28 PM Conor Dooley <[email protected]> wrote:
>
> > > > > I think we have two possible approaches:
> > > > >
> > > > > 1) Update OpenSBI to set "no-map" DT property for firmware
> > > > > reserved regions. We were doing this previously but removed
> > > > > it later for performance reasons mentioned by Alex. It is also
> > > > > worth mentioning that ARM Trusted Firmware also sets "no-map"
> > > > > DT property for firmware reserved regions.
> > > > >
> > > > > 2) Add a new "no-save-restore" DT property in the reserved
> > > > > memory DT bindings. The hibernate support of Linux arch/riscv
> > > > > will use this DT property to exclude memory regions from
> > > > > save-restore. The EFI implementation of EDK2 and U-Boot
> > > > > should do the following:
> > > > > 1) Treat all memory having "no-map" DT property as EFI
> > > > > reserved memory
> > > > > 2) Treat all memory not having "no-map" DT property and
> > > > > not having "no-save-restore" DT property as EfiBootServicesData
> > > > > 3) Treat all memory not having "no-map" DT property and
> > > > > having "no-save-restore" DT property as EfiRuntimeServiceData
> > > > > (Refer,
> > > > > https://devicetree-specification.readthedocs.io/en/latest/chapter3-devicenodes.html#reserved-memory-and-uefi)
> > > > >
> > > > > Personally, I am leaning towards approach#1 since approach#2
> > > > > will require changing DeviceTree specification as well.
> > > >
> > > >
> > > > If needed, indeed #1 is the simplest, but I insist, to me it is not
> > > > needed (and we don't have it in the current opensbi), if you have
> > > > another opinion, I'm open to discuss it!
> > >
> >
> > The problem with relying on the "mmode_resv" name is that there will be
> > other use cases where a portion of the memory must be reserved and accessing
> > that from the kernel will result in fault. CoVE is such a use case where
> > TSM will probably run from a memory region with confidential memory
> > which the kernel must not access.
>
> We should only rely on this node name for known bad versions of opensbi
> IMO. Going forward, if something needs to be reserved for firmware, the
> firmware should make sure that it is reserved by using the property for
> that purpose :)

There is no issue with OpenSBI since it does the right thing by marking
memory as reserved in the DT. This real issue is with the kernel handling
of reserved memory for hibernate.

Like Atish mentioned, not just OpenSBI, there will be other entities
(like TSM) or some other M-mode firmware which will also reserve
memory in DT/ACPI so clearly kernel needs a SBI implementation
independent way of handling reserved memory for hibernate.

Regards,
Anup

>
> > We have to name it "mmode_resv" as well or mark it as "no-map" which will
> > present a hole in mappings. We will end up in same 1GB hugepage issue
> > if we choose "no-map".
> >
> > Another option is to use compatible string or label property to indicate
> > that this memory region is not to be saved/restored during hibernation.
> > This can be documented in RISC-V DT bindings as well as the booting guide
> > doc that alex was talking about.
>
> Sure, a dt-binding for sbi reserved regions doesn't immediately sound
> like an awful idea... But we still have to work around the borked
> firmware - be that disabling hibernation or using the mmode_resv node
> when we know that the version of OpenSBI is one of those with the
> problem.
>
> > > I agree with you, backward compatibility with older firmwares
> > > is important.
> > >
> > This does break the compatibility with the older firmware w.r.to hibernation
> > feature. However
> >
> > It is specific to hibernation only. So hibernation will fail to work
> > if an user is running kernel > 6.4 but 0.8 < OpenSBI < 1.2
> >
> > The same problem lies if users use other firmware that don't have
> > no-map property today. IMO, this can be documented as a known problem.
>
> I'd rather we disabled it than documented it as broken.
> Or disable _and_ document it.
>
> > > Let's go with your proposed change to treat reserved DT nodes
> > > with "mmode_resv*" name as M-mode firmware memory (it could
> > > be any M-mode firmware). We will certainly need to document it
> > > somewhere as an expectation of Linux RISC-V kernel.
> > >
> > > @Sunil How about treating "mmode_resv*" as
> > > EfiRuntimeServiceData in EDK2 ? Other reserved memory
> > > nodes can follow the device tree specification.
> > >
> >
> > Either way, we also need to fix U-Boot to match the behavior. Currently,
> > it treats any reserved memory without no-map property as EFI_BOOT_SERVICES_DATA.
>
> Cheers,
> Conor.

2023-05-25 13:45:33

by Conor Dooley

[permalink] [raw]
Subject: Re: Bug report: kernel paniced when system hibernates

On Thu, May 25, 2023 at 06:51:28PM +0530, Anup Patel wrote:

> > We should only rely on this node name for known bad versions of opensbi
> > IMO. Going forward, if something needs to be reserved for firmware, the
> > firmware should make sure that it is reserved by using the property for
> > that purpose :)

> There is no issue with OpenSBI since it does the right thing by marking
> memory as reserved in the DT. This real issue is with the kernel handling
> of reserved memory for hibernate.

I don't think we are talking about the same thing here. I meant the
no-map property which OpenSBI does not set.

> Like Atish mentioned, not just OpenSBI, there will be other entities
> (like TSM) or some other M-mode firmware which will also reserve
> memory in DT/ACPI so clearly kernel needs a SBI implementation
> independent way of handling reserved memory for hibernate.

> > > Another option is to use compatible string or label property to indicate
> > > that this memory region is not to be saved/restored during hibernation.
> > > This can be documented in RISC-V DT bindings as well as the booting guide
> > > doc that alex was talking about.
> >
> > Sure, a dt-binding for sbi reserved regions doesn't immediately sound
> > like an awful idea... But we still have to work around the borked
> > firmware - be that disabling hibernation or using the mmode_resv node
> > when we know that the version of OpenSBI is one of those with the
> > problem.

Did you skip over this? I was agreeing that defining a common binding for
sbi reserved regions was a good idea.


Attachments:
(No filename) (1.56 kB)
signature.asc (235.00 B)
Download all attachments

2023-05-25 13:47:16

by Anup Patel

[permalink] [raw]
Subject: Re: Bug report: kernel paniced when system hibernates

On Thu, May 25, 2023 at 7:08 PM Conor Dooley <[email protected]> wrote:
>
> On Thu, May 25, 2023 at 06:51:28PM +0530, Anup Patel wrote:
>
> > > We should only rely on this node name for known bad versions of opensbi
> > > IMO. Going forward, if something needs to be reserved for firmware, the
> > > firmware should make sure that it is reserved by using the property for
> > > that purpose :)
>
> > There is no issue with OpenSBI since it does the right thing by marking
> > memory as reserved in the DT. This real issue is with the kernel handling
> > of reserved memory for hibernate.
>
> I don't think we are talking about the same thing here. I meant the
> no-map property which OpenSBI does not set.

Yes, we are talking about the same thing. It's not just OpenSBI not
setting no-map property in reserved memory node because other
SBI implementations would be doing the same thing (i.e. not setting
no-map property)

Regards,
Anup

>
> > Like Atish mentioned, not just OpenSBI, there will be other entities
> > (like TSM) or some other M-mode firmware which will also reserve
> > memory in DT/ACPI so clearly kernel needs a SBI implementation
> > independent way of handling reserved memory for hibernate.
>
> > > > Another option is to use compatible string or label property to indicate
> > > > that this memory region is not to be saved/restored during hibernation.
> > > > This can be documented in RISC-V DT bindings as well as the booting guide
> > > > doc that alex was talking about.
> > >
> > > Sure, a dt-binding for sbi reserved regions doesn't immediately sound
> > > like an awful idea... But we still have to work around the borked
> > > firmware - be that disabling hibernation or using the mmode_resv node
> > > when we know that the version of OpenSBI is one of those with the
> > > problem.
>
> Did you skip over this? I was agreeing that defining a common binding for
> sbi reserved regions was a good idea.

2023-05-25 14:20:57

by Anup Patel

[permalink] [raw]
Subject: Re: Bug report: kernel paniced when system hibernates

On Thu, May 25, 2023 at 7:26 PM Conor Dooley <[email protected]> wrote:
>
> On Thu, May 25, 2023 at 07:13:11PM +0530, Anup Patel wrote:
> > On Thu, May 25, 2023 at 7:08 PM Conor Dooley <[email protected]> wrote:
> > >
> > > On Thu, May 25, 2023 at 06:51:28PM +0530, Anup Patel wrote:
> > >
> > > > > We should only rely on this node name for known bad versions of opensbi
> > > > > IMO. Going forward, if something needs to be reserved for firmware, the
> > > > > firmware should make sure that it is reserved by using the property for
> > > > > that purpose :)
> > >
> > > > There is no issue with OpenSBI since it does the right thing by marking
> > > > memory as reserved in the DT. This real issue is with the kernel handling
> > > > of reserved memory for hibernate.
> > >
> > > I don't think we are talking about the same thing here. I meant the
> > > no-map property which OpenSBI does not set.
> >
> > Yes, we are talking about the same thing. It's not just OpenSBI not
> > setting no-map property in reserved memory node because other
> > SBI implementations would be doing the same thing (i.e. not setting
> > no-map property)
>
> Other SBI implementations doing the same thing doesn't make it any more
> correct though, right?

Like multiple folks suggested, we need DT binding for distinguishing
firmware reserved memory from other reserved memory. Until that
happens we should either mark hibernate support as experimental
or revert it.

Regards,
Anup

>
> > > > Like Atish mentioned, not just OpenSBI, there will be other entities
> > > > (like TSM) or some other M-mode firmware which will also reserve
> > > > memory in DT/ACPI so clearly kernel needs a SBI implementation
> > > > independent way of handling reserved memory for hibernate.
> > >
> > > > > > Another option is to use compatible string or label property to indicate
> > > > > > that this memory region is not to be saved/restored during hibernation.
> > > > > > This can be documented in RISC-V DT bindings as well as the booting guide
> > > > > > doc that alex was talking about.
> > > > >
> > > > > Sure, a dt-binding for sbi reserved regions doesn't immediately sound
> > > > > like an awful idea... But we still have to work around the borked
> > > > > firmware - be that disabling hibernation or using the mmode_resv node
> > > > > when we know that the version of OpenSBI is one of those with the
> > > > > problem.
> > >
> > > Did you skip over this? I was agreeing that defining a common binding for
> > > sbi reserved regions was a good idea.

2023-05-25 14:21:25

by Conor Dooley

[permalink] [raw]
Subject: Re: Bug report: kernel paniced when system hibernates

On Thu, May 25, 2023 at 07:13:11PM +0530, Anup Patel wrote:
> On Thu, May 25, 2023 at 7:08 PM Conor Dooley <[email protected]> wrote:
> >
> > On Thu, May 25, 2023 at 06:51:28PM +0530, Anup Patel wrote:
> >
> > > > We should only rely on this node name for known bad versions of opensbi
> > > > IMO. Going forward, if something needs to be reserved for firmware, the
> > > > firmware should make sure that it is reserved by using the property for
> > > > that purpose :)
> >
> > > There is no issue with OpenSBI since it does the right thing by marking
> > > memory as reserved in the DT. This real issue is with the kernel handling
> > > of reserved memory for hibernate.
> >
> > I don't think we are talking about the same thing here. I meant the
> > no-map property which OpenSBI does not set.
>
> Yes, we are talking about the same thing. It's not just OpenSBI not
> setting no-map property in reserved memory node because other
> SBI implementations would be doing the same thing (i.e. not setting
> no-map property)

Other SBI implementations doing the same thing doesn't make it any more
correct though, right?

> > > Like Atish mentioned, not just OpenSBI, there will be other entities
> > > (like TSM) or some other M-mode firmware which will also reserve
> > > memory in DT/ACPI so clearly kernel needs a SBI implementation
> > > independent way of handling reserved memory for hibernate.
> >
> > > > > Another option is to use compatible string or label property to indicate
> > > > > that this memory region is not to be saved/restored during hibernation.
> > > > > This can be documented in RISC-V DT bindings as well as the booting guide
> > > > > doc that alex was talking about.
> > > >
> > > > Sure, a dt-binding for sbi reserved regions doesn't immediately sound
> > > > like an awful idea... But we still have to work around the borked
> > > > firmware - be that disabling hibernation or using the mmode_resv node
> > > > when we know that the version of OpenSBI is one of those with the
> > > > problem.
> >
> > Did you skip over this? I was agreeing that defining a common binding for
> > sbi reserved regions was a good idea.


Attachments:
(No filename) (2.15 kB)
signature.asc (235.00 B)
Download all attachments

2023-05-25 14:55:35

by Conor Dooley

[permalink] [raw]
Subject: Re: Bug report: kernel paniced when system hibernates

On Thu, May 25, 2023 at 07:29:46PM +0530, Anup Patel wrote:
> On Thu, May 25, 2023 at 7:26 PM Conor Dooley <[email protected]> wrote:
> >
> > On Thu, May 25, 2023 at 07:13:11PM +0530, Anup Patel wrote:
> > > On Thu, May 25, 2023 at 7:08 PM Conor Dooley <[email protected]> wrote:
> > > >
> > > > On Thu, May 25, 2023 at 06:51:28PM +0530, Anup Patel wrote:
> > > >
> > > > > > We should only rely on this node name for known bad versions of opensbi
> > > > > > IMO. Going forward, if something needs to be reserved for firmware, the
> > > > > > firmware should make sure that it is reserved by using the property for
> > > > > > that purpose :)
> > > >
> > > > > There is no issue with OpenSBI since it does the right thing by marking
> > > > > memory as reserved in the DT. This real issue is with the kernel handling
> > > > > of reserved memory for hibernate.
> > > >
> > > > I don't think we are talking about the same thing here. I meant the
> > > > no-map property which OpenSBI does not set.
> > >
> > > Yes, we are talking about the same thing. It's not just OpenSBI not
> > > setting no-map property in reserved memory node because other
> > > SBI implementations would be doing the same thing (i.e. not setting
> > > no-map property)
> >
> > Other SBI implementations doing the same thing doesn't make it any more
> > correct though, right?
>
> Like multiple folks suggested, we need DT binding for distinguishing
> firmware reserved memory from other reserved memory.

And I have agreed with multiple times!

> Until that
> happens we should either mark hibernate support as experimental
> or revert it.

That works for me. How about the below?

-- >8 --
From 1d4381290a1600eff9b29b8ace6be73955d9726c Mon Sep 17 00:00:00 2001
From: Conor Dooley <[email protected]>
Date: Thu, 25 May 2023 15:09:08 +0100
Subject: [PATCH] RISC-V: mark hibernation as broken

Hibernation support depends on firmware marking its reserved
regions as not mappable by Linux. As things stand, the de-facto SBI
implementation (OpenSBI) does not do this, and other implementations may
not do so either, resulting in kernel panics during hibernation ([1],
[2]).

Disable support for hibernation until such time that an SBI
implementation independent way to communicate what regions are reserved
has been agreed upon.

Reported-by: Song Shuai <[email protected]>
Link: https://lore.kernel.org/all/CAAYs2=gQvkhTeioMmqRDVGjdtNF_vhB+vm_1dHJxPNi75YDQ_Q@mail.gmail.com/ [1]
Reported-by: JeeHeng Sia <[email protected]>
Link: https://groups.google.com/a/groups.riscv.org/g/sw-dev/c/ITXwaKfA6z8
Signed-off-by: Conor Dooley <[email protected]>
---
arch/riscv/Kconfig | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 13f058490608..b2495192f35a 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -801,7 +801,7 @@ menu "Power management options"
source "kernel/power/Kconfig"

config ARCH_HIBERNATION_POSSIBLE
- def_bool y
+ def_bool n

config ARCH_HIBERNATION_HEADER
def_bool HIBERNATION
--
2.39.2


Attachments:
(No filename) (3.11 kB)
signature.asc (235.00 B)
Download all attachments

2023-05-25 17:45:33

by Atish Patra

[permalink] [raw]
Subject: Re: Bug report: kernel paniced when system hibernates

On Thu, May 25, 2023 at 7:21 AM Conor Dooley <[email protected]> wrote:
>
> On Thu, May 25, 2023 at 07:29:46PM +0530, Anup Patel wrote:
> > On Thu, May 25, 2023 at 7:26 PM Conor Dooley <[email protected]> wrote:
> > >
> > > On Thu, May 25, 2023 at 07:13:11PM +0530, Anup Patel wrote:
> > > > On Thu, May 25, 2023 at 7:08 PM Conor Dooley <[email protected]> wrote:
> > > > >
> > > > > On Thu, May 25, 2023 at 06:51:28PM +0530, Anup Patel wrote:
> > > > >
> > > > > > > We should only rely on this node name for known bad versions of opensbi
> > > > > > > IMO. Going forward, if something needs to be reserved for firmware, the
> > > > > > > firmware should make sure that it is reserved by using the property for
> > > > > > > that purpose :)
> > > > >
> > > > > > There is no issue with OpenSBI since it does the right thing by marking
> > > > > > memory as reserved in the DT. This real issue is with the kernel handling
> > > > > > of reserved memory for hibernate.
> > > > >
> > > > > I don't think we are talking about the same thing here. I meant the
> > > > > no-map property which OpenSBI does not set.
> > > >
> > > > Yes, we are talking about the same thing. It's not just OpenSBI not
> > > > setting no-map property in reserved memory node because other
> > > > SBI implementations would be doing the same thing (i.e. not setting
> > > > no-map property)
> > >
> > > Other SBI implementations doing the same thing doesn't make it any more
> > > correct though, right?
> >
> > Like multiple folks suggested, we need DT binding for distinguishing
> > firmware reserved memory from other reserved memory.
>
> And I have agreed with multiple times!
>
> > Until that
> > happens we should either mark hibernate support as experimental
> > or revert it.
>
> That works for me. How about the below?
>

Instead of disabling hibernate support why not revert the patch
3335068 ("riscv: Use PUD/P4D/PGD pages for the linear mapping")
which doesn't add any "measured" value at this point.
However, keeping the hibernation feature on and disabling linear
mapping will get more
testing on hibernation. While disabling hibernation and keeping the above patch
which doesn't have any value at all.

We don't have a regression at this point. So either approach will work though.

If we choose to go this route, some thoughts about the commit message.
> -- >8 --
> From 1d4381290a1600eff9b29b8ace6be73955d9726c Mon Sep 17 00:00:00 2001
> From: Conor Dooley <[email protected]>
> Date: Thu, 25 May 2023 15:09:08 +0100
> Subject: [PATCH] RISC-V: mark hibernation as broken
>
> Hibernation support depends on firmware marking its reserved
> regions as not mappable by Linux. As things stand, the de-facto SBI

either not mappable or no save/restore capable (as We still have not
concluded which way we want to go in)

> implementation (OpenSBI) does not do this, and other implementations may
> not do so either, resulting in kernel panics during hibernation ([1],
> [2]).
>

we should probably add more context in the commit message.
How about adding something along these lines:

As things stand, the latest version of de-facto SBI
implementation(OpenSBI) doesn't
do this any more to allow 1G huge page mappings by kernel. Other SBI
implementations are probably
doing the same. Until the commit 3335068 ("riscv: Use PUD/P4D/PGD
pages for the linear mapping"),
the first 2MB region of DRAM (where the typically firmware resides)
was not mappable by kernel. However,
enabling that mapping resulted in the kernel panics during hibernation
([1], [2]) as the hibernation process
tries to save/restore any mapped region even though it is marked as reserved.


> Disable support for hibernation until such time that an SBI
> implementation independent way to communicate what regions are reserved
> has been agreed upon.
>

Anybody who wants to test the hibernation feature must revert the
above mentioned patch along with turning on
the config.

> Reported-by: Song Shuai <[email protected]>
> Link: https://lore.kernel.org/all/CAAYs2=gQvkhTeioMmqRDVGjdtNF_vhB+vm_1dHJxPNi75YDQ_Q@mail.gmail.com/ [1]
> Reported-by: JeeHeng Sia <[email protected]>
> Link: https://groups.google.com/a/groups.riscv.org/g/sw-dev/c/ITXwaKfA6z8
> Signed-off-by: Conor Dooley <[email protected]>
> ---
> arch/riscv/Kconfig | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> index 13f058490608..b2495192f35a 100644
> --- a/arch/riscv/Kconfig
> +++ b/arch/riscv/Kconfig
> @@ -801,7 +801,7 @@ menu "Power management options"
> source "kernel/power/Kconfig"
>
> config ARCH_HIBERNATION_POSSIBLE
> - def_bool y
> + def_bool n
>
> config ARCH_HIBERNATION_HEADER
> def_bool HIBERNATION
> --
> 2.39.2
>


--
Regards,
Atish

2023-05-25 18:33:31

by Conor Dooley

[permalink] [raw]
Subject: Re: Bug report: kernel paniced when system hibernates

Hey Atish,

On Thu, May 25, 2023 at 10:39:44AM -0700, Atish Patra wrote:
> > How about the below?

> Instead of disabling hibernate support why not revert the patch
> 3335068 ("riscv: Use PUD/P4D/PGD pages for the linear mapping")
> which doesn't add any "measured" value at this point.
> However, keeping the hibernation feature on and disabling linear
> mapping will get more testing on hibernation.
> While disabling hibernation and keeping the above patch which
> doesn't have any value at all.
>
> We don't have a regression at this point. So either approach will work though.

I favoured this approach so that we do not release a kernel in which
hibernate works for these versions of OpenSBI and then stops working in
the future when we shore up how communicating this is supposed to work.
It allows us to fix the problem "properly" in slow-time, instead of
racing against v6.4's release.

I happened to be talking to Palmer and he suggested making it depend on
NONPORTABLE:
|> config NONPORTABLE
|> bool "Allow configurations that result in non-portable kernels"
|> help
|> RISC-V kernel binaries are compatible between all known systems
|> whenever possible, but there are some use cases that can only be
|> satisfied by configurations that result in kernel binaries that are
|> not portable between systems.
|>
|> Selecting N does not guarantee kernels will be portable to all known
|> systems. Selecting any of the options guarded by NONPORTABLE will
|> result in kernel binaries that are unlikely to be portable between
|> systems.
|>
|> If unsure, say N.

I actually think that that makes more sense, as it may actually be fine
to use hibernation depending on what your SBI implementation does.

> If we choose to go this route, some thoughts about the commit message.
> > -- >8 --
> > From 1d4381290a1600eff9b29b8ace6be73955d9726c Mon Sep 17 00:00:00 2001
> > From: Conor Dooley <[email protected]>
> > Date: Thu, 25 May 2023 15:09:08 +0100
> > Subject: [PATCH] RISC-V: mark hibernation as broken
> >
> > Hibernation support depends on firmware marking its reserved
> > regions as not mappable by Linux. As things stand, the de-facto SBI
>
> either not mappable or no save/restore capable (as We still have not
> concluded which way we want to go in)

s/mappable/accessible/? Sounds like a good catch all?

>
> > implementation (OpenSBI) does not do this, and other implementations may
> > not do so either, resulting in kernel panics during hibernation ([1],
> > [2]).
> >
>
> we should probably add more context in the commit message.
> How about adding something along these lines:
>
> As things stand, the latest version of de-facto SBI
> implementation(OpenSBI) doesn't
> do this any more to allow 1G huge page mappings by kernel. Other SBI
> implementations are probably
> doing the same. Until the commit 3335068 ("riscv: Use PUD/P4D/PGD
> pages for the linear mapping"),
> the first 2MB region of DRAM (where the typically firmware resides)
> was not mappable by kernel. However,
> enabling that mapping resulted in the kernel panics during hibernation
> ([1], [2]) as the hibernation process
> tries to save/restore any mapped region even though it is marked as reserved.

SGTM, I could go with that.

> > Disable support for hibernation until such time that an SBI
> > implementation independent way to communicate what regions are reserved
> > has been agreed upon.
> >
>
> Anybody who wants to test the hibernation feature must revert the
> above mentioned patch along with turning on
> the config.

This goes away with the use of non-portable, although I would work
mention of the config option into the commit message.

Thanks,
Conor.

> > Reported-by: Song Shuai <[email protected]>
> > Link: https://lore.kernel.org/all/CAAYs2=gQvkhTeioMmqRDVGjdtNF_vhB+vm_1dHJxPNi75YDQ_Q@mail.gmail.com/ [1]
> > Reported-by: JeeHeng Sia <[email protected]>
> > Link: https://groups.google.com/a/groups.riscv.org/g/sw-dev/c/ITXwaKfA6z8
> > Signed-off-by: Conor Dooley <[email protected]>
> > ---
> > arch/riscv/Kconfig | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> > index 13f058490608..b2495192f35a 100644
> > --- a/arch/riscv/Kconfig
> > +++ b/arch/riscv/Kconfig
> > @@ -801,7 +801,7 @@ menu "Power management options"
> > source "kernel/power/Kconfig"
> >
> > config ARCH_HIBERNATION_POSSIBLE
> > - def_bool y
> > + def_bool n
> >
> > config ARCH_HIBERNATION_HEADER
> > def_bool HIBERNATION


Attachments:
(No filename) (4.58 kB)
signature.asc (235.00 B)
Download all attachments

2023-05-25 18:48:13

by Conor Dooley

[permalink] [raw]
Subject: Re: Bug report: kernel paniced when system hibernates

On Thu, May 25, 2023 at 11:37:40AM -0700, Atish Patra wrote:

> Any testing of hibernation still needs to revert the patch until we
> have the proper fix.

"the patch" is what exactly? I assume you don't mean depending on
NONPORTABLE, since that is a Kconfig option.


Attachments:
(No filename) (274.00 B)
signature.asc (235.00 B)
Download all attachments

2023-05-25 19:45:24

by Atish Patra

[permalink] [raw]
Subject: Re: Bug report: kernel paniced when system hibernates

On Thu, May 25, 2023 at 11:22 AM Conor Dooley <[email protected]> wrote:
>
> Hey Atish,
>
> On Thu, May 25, 2023 at 10:39:44AM -0700, Atish Patra wrote:
> > > How about the below?
>
> > Instead of disabling hibernate support why not revert the patch
> > 3335068 ("riscv: Use PUD/P4D/PGD pages for the linear mapping")
> > which doesn't add any "measured" value at this point.
> > However, keeping the hibernation feature on and disabling linear
> > mapping will get more testing on hibernation.
> > While disabling hibernation and keeping the above patch which
> > doesn't have any value at all.
> >
> > We don't have a regression at this point. So either approach will work though.
>
> I favoured this approach so that we do not release a kernel in which
> hibernate works for these versions of OpenSBI and then stops working in
> the future when we shore up how communicating this is supposed to work.
> It allows us to fix the problem "properly" in slow-time, instead of
> racing against v6.4's release.
>

Fair enough.

> I happened to be talking to Palmer and he suggested making it depend on
> NONPORTABLE:
> |> config NONPORTABLE
> |> bool "Allow configurations that result in non-portable kernels"
> |> help
> |> RISC-V kernel binaries are compatible between all known systems
> |> whenever possible, but there are some use cases that can only be
> |> satisfied by configurations that result in kernel binaries that are
> |> not portable between systems.
> |>
> |> Selecting N does not guarantee kernels will be portable to all known
> |> systems. Selecting any of the options guarded by NONPORTABLE will
> |> result in kernel binaries that are unlikely to be portable between
> |> systems.
> |>
> |> If unsure, say N.
>
> I actually think that that makes more sense, as it may actually be fine
> to use hibernation depending on what your SBI implementation does.
>

That works too.

> > If we choose to go this route, some thoughts about the commit message.
> > > -- >8 --
> > > From 1d4381290a1600eff9b29b8ace6be73955d9726c Mon Sep 17 00:00:00 2001
> > > From: Conor Dooley <[email protected]>
> > > Date: Thu, 25 May 2023 15:09:08 +0100
> > > Subject: [PATCH] RISC-V: mark hibernation as broken
> > >
> > > Hibernation support depends on firmware marking its reserved
> > > regions as not mappable by Linux. As things stand, the de-facto SBI
> >
> > either not mappable or no save/restore capable (as We still have not
> > concluded which way we want to go in)
>
> s/mappable/accessible/? Sounds like a good catch all?
>

Yeah.

> >
> > > implementation (OpenSBI) does not do this, and other implementations may
> > > not do so either, resulting in kernel panics during hibernation ([1],
> > > [2]).
> > >
> >
> > we should probably add more context in the commit message.
> > How about adding something along these lines:
> >
> > As things stand, the latest version of de-facto SBI
> > implementation(OpenSBI) doesn't
> > do this any more to allow 1G huge page mappings by kernel. Other SBI
> > implementations are probably
> > doing the same. Until the commit 3335068 ("riscv: Use PUD/P4D/PGD
> > pages for the linear mapping"),
> > the first 2MB region of DRAM (where the typically firmware resides)
> > was not mappable by kernel. However,
> > enabling that mapping resulted in the kernel panics during hibernation
> > ([1], [2]) as the hibernation process
> > tries to save/restore any mapped region even though it is marked as reserved.
>
> SGTM, I could go with that.
>
> > > Disable support for hibernation until such time that an SBI
> > > implementation independent way to communicate what regions are reserved
> > > has been agreed upon.
> > >
> >
> > Anybody who wants to test the hibernation feature must revert the
> > above mentioned patch along with turning on
> > the config.
>
> This goes away with the use of non-portable, although I would work
> mention of the config option into the commit message.
>

Any testing of hibernation still needs to revert the patch until we
have the proper fix.

> Thanks,
> Conor.
>
> > > Reported-by: Song Shuai <[email protected]>
> > > Link: https://lore.kernel.org/all/CAAYs2=gQvkhTeioMmqRDVGjdtNF_vhB+vm_1dHJxPNi75YDQ_Q@mail.gmail.com/ [1]
> > > Reported-by: JeeHeng Sia <[email protected]>
> > > Link: https://groups.google.com/a/groups.riscv.org/g/sw-dev/c/ITXwaKfA6z8
> > > Signed-off-by: Conor Dooley <[email protected]>
> > > ---
> > > arch/riscv/Kconfig | 2 +-
> > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> > > index 13f058490608..b2495192f35a 100644
> > > --- a/arch/riscv/Kconfig
> > > +++ b/arch/riscv/Kconfig
> > > @@ -801,7 +801,7 @@ menu "Power management options"
> > > source "kernel/power/Kconfig"
> > >
> > > config ARCH_HIBERNATION_POSSIBLE
> > > - def_bool y
> > > + def_bool n
> > >
> > > config ARCH_HIBERNATION_HEADER
> > > def_bool HIBERNATION
>


--
Regards,
Atish

2023-05-25 20:44:56

by Atish Patra

[permalink] [raw]
Subject: Re: Bug report: kernel paniced when system hibernates

On Thu, May 25, 2023 at 11:39 AM Conor Dooley <[email protected]> wrote:
>
> On Thu, May 25, 2023 at 11:37:40AM -0700, Atish Patra wrote:
>
> > Any testing of hibernation still needs to revert the patch until we
> > have the proper fix.
>
> "the patch" is what exactly? I assume you don't mean depending on
> NONPORTABLE, since that is a Kconfig option.

Nope. Sorry I meant the commit

3335068 ("riscv: Use PUD/P4D/PGD pages for the linear mapping")


--
Regards,
Atish

2023-05-25 21:37:07

by Conor Dooley

[permalink] [raw]
Subject: Re: Bug report: kernel paniced when system hibernates

On Thu, May 25, 2023 at 01:06:04PM -0700, Atish Patra wrote:
> On Thu, May 25, 2023 at 11:39 AM Conor Dooley <[email protected]> wrote:
> >
> > On Thu, May 25, 2023 at 11:37:40AM -0700, Atish Patra wrote:
> >
> > > Any testing of hibernation still needs to revert the patch until we
> > > have the proper fix.
> >
> > "the patch" is what exactly? I assume you don't mean depending on
> > NONPORTABLE, since that is a Kconfig option.
>
> Nope. Sorry I meant the commit
>
> 3335068 ("riscv: Use PUD/P4D/PGD pages for the linear mapping")

Ah, if your SBI implementation is one of the affected ones, yeah.
If not, you can just set NONPORTABLE :)


Attachments:
(No filename) (663.00 B)
signature.asc (235.00 B)
Download all attachments

2023-05-26 13:19:22

by Alexandre Ghiti

[permalink] [raw]
Subject: Re: Bug report: kernel paniced when system hibernates

Hi everyone,

On Thu, May 25, 2023 at 11:24 PM Conor Dooley <[email protected]> wrote:
>
> On Thu, May 25, 2023 at 01:06:04PM -0700, Atish Patra wrote:
> > On Thu, May 25, 2023 at 11:39 AM Conor Dooley <[email protected]> wrote:
> > >
> > > On Thu, May 25, 2023 at 11:37:40AM -0700, Atish Patra wrote:
> > >
> > > > Any testing of hibernation still needs to revert the patch until we
> > > > have the proper fix.
> > >
> > > "the patch" is what exactly? I assume you don't mean depending on
> > > NONPORTABLE, since that is a Kconfig option.
> >
> > Nope. Sorry I meant the commit
> >
> > 3335068 ("riscv: Use PUD/P4D/PGD pages for the linear mapping")
>
> Ah, if your SBI implementation is one of the affected ones, yeah.
> If not, you can just set NONPORTABLE :)

@Björn Töpel emitted the idea of excluding from the hibernation all
the memory nodes in the "/reserved-memory" node
(https://elixir.bootlin.com/linux/v6.4-rc1/source/Documentation/devicetree/bindings/reserved-memory/reserved-memory.yaml):
I have to admit that I don't see why it is not done by default by the
kernel.

Unless there is stuff in this node that needs to be "hibernated", I
think that would be a very good solution since we would not rely on
the name of the "internal" nodes of "/reserved-memory" (i.e.
"mmode_resv").

I'm digging into why it is not done by default, just wanted to have
your feedback before the week-end :)

2023-05-26 15:18:17

by Conor Dooley

[permalink] [raw]
Subject: Re: Bug report: kernel paniced when system hibernates

On Fri, May 26, 2023 at 03:14:33PM +0200, Alexandre Ghiti wrote:
> Hi everyone,
>
> On Thu, May 25, 2023 at 11:24 PM Conor Dooley <[email protected]> wrote:
> >
> > On Thu, May 25, 2023 at 01:06:04PM -0700, Atish Patra wrote:
> > > On Thu, May 25, 2023 at 11:39 AM Conor Dooley <[email protected]> wrote:
> > > >
> > > > On Thu, May 25, 2023 at 11:37:40AM -0700, Atish Patra wrote:
> > > >
> > > > > Any testing of hibernation still needs to revert the patch until we
> > > > > have the proper fix.
> > > >
> > > > "the patch" is what exactly? I assume you don't mean depending on
> > > > NONPORTABLE, since that is a Kconfig option.
> > >
> > > Nope. Sorry I meant the commit
> > >
> > > 3335068 ("riscv: Use PUD/P4D/PGD pages for the linear mapping")
> >
> > Ah, if your SBI implementation is one of the affected ones, yeah.
> > If not, you can just set NONPORTABLE :)
>
> @Björn Töpel emitted the idea of excluding from the hibernation all
> the memory nodes in the "/reserved-memory" node
> (https://elixir.bootlin.com/linux/v6.4-rc1/source/Documentation/devicetree/bindings/reserved-memory/reserved-memory.yaml):
> I have to admit that I don't see why it is not done by default by the
> kernel.

My understanding was that it was perfectly fine to use reserved memory
nodes to fence off some memory to use in device drivers etc, which then
may need to be saved/restored.

> Unless there is stuff in this node that needs to be "hibernated", I
> think that would be a very good solution since we would not rely on
> the name of the "internal" nodes of "/reserved-memory" (i.e.
> "mmode_resv").
>
> I'm digging into why it is not done by default, just wanted to have
> your feedback before the week-end :)


Attachments:
(No filename) (1.71 kB)
signature.asc (235.00 B)
Download all attachments

2023-05-26 15:26:15

by Anup Patel

[permalink] [raw]
Subject: Re: Bug report: kernel paniced when system hibernates

On Fri, May 26, 2023 at 8:42 PM Alexandre Ghiti <[email protected]> wrote:
>
>
> On 26/05/2023 16:59, Conor Dooley wrote:
> > On Fri, May 26, 2023 at 03:14:33PM +0200, Alexandre Ghiti wrote:
> >> Hi everyone,
> >>
> >> On Thu, May 25, 2023 at 11:24 PM Conor Dooley <[email protected]> wrote:
> >>> On Thu, May 25, 2023 at 01:06:04PM -0700, Atish Patra wrote:
> >>>> On Thu, May 25, 2023 at 11:39 AM Conor Dooley <[email protected]> wrote:
> >>>>> On Thu, May 25, 2023 at 11:37:40AM -0700, Atish Patra wrote:
> >>>>>
> >>>>>> Any testing of hibernation still needs to revert the patch until we
> >>>>>> have the proper fix.
> >>>>> "the patch" is what exactly? I assume you don't mean depending on
> >>>>> NONPORTABLE, since that is a Kconfig option.
> >>>> Nope. Sorry I meant the commit
> >>>>
> >>>> 3335068 ("riscv: Use PUD/P4D/PGD pages for the linear mapping")
> >>> Ah, if your SBI implementation is one of the affected ones, yeah.
> >>> If not, you can just set NONPORTABLE :)
> >> @Björn Töpel emitted the idea of excluding from the hibernation all
> >> the memory nodes in the "/reserved-memory" node
> >> (https://elixir.bootlin.com/linux/v6.4-rc1/source/Documentation/devicetree/bindings/reserved-memory/reserved-memory.yaml):
> >> I have to admit that I don't see why it is not done by default by the
> >> kernel.
> > My understanding was that it was perfectly fine to use reserved memory
> > nodes to fence off some memory to use in device drivers etc, which then
> > may need to be saved/restored.
>
>
> Agreed, but I would say that it's up to the driver then to take care of
> that, see https://docs.kernel.org/driver-api/pm/notifiers.html

I agree, it should be drivers responsibility to save/restore the dedicated
reserved memory used by itself.

Although, I think we should at least save/restore reserved memory
regions having "reusable" property set.

Regards,
Anup

>
>
> >> Unless there is stuff in this node that needs to be "hibernated", I
> >> think that would be a very good solution since we would not rely on
> >> the name of the "internal" nodes of "/reserved-memory" (i.e.
> >> "mmode_resv").
> >>
> >> I'm digging into why it is not done by default, just wanted to have
> >> your feedback before the week-end :)

2023-05-26 15:34:18

by Alexandre Ghiti

[permalink] [raw]
Subject: Re: Bug report: kernel paniced when system hibernates

On Fri, May 26, 2023 at 5:17 PM Anup Patel <[email protected]> wrote:
>
> On Fri, May 26, 2023 at 8:42 PM Alexandre Ghiti <[email protected]> wrote:
> >
> >
> > On 26/05/2023 16:59, Conor Dooley wrote:
> > > On Fri, May 26, 2023 at 03:14:33PM +0200, Alexandre Ghiti wrote:
> > >> Hi everyone,
> > >>
> > >> On Thu, May 25, 2023 at 11:24 PM Conor Dooley <[email protected]> wrote:
> > >>> On Thu, May 25, 2023 at 01:06:04PM -0700, Atish Patra wrote:
> > >>>> On Thu, May 25, 2023 at 11:39 AM Conor Dooley <[email protected]> wrote:
> > >>>>> On Thu, May 25, 2023 at 11:37:40AM -0700, Atish Patra wrote:
> > >>>>>
> > >>>>>> Any testing of hibernation still needs to revert the patch until we
> > >>>>>> have the proper fix.
> > >>>>> "the patch" is what exactly? I assume you don't mean depending on
> > >>>>> NONPORTABLE, since that is a Kconfig option.
> > >>>> Nope. Sorry I meant the commit
> > >>>>
> > >>>> 3335068 ("riscv: Use PUD/P4D/PGD pages for the linear mapping")
> > >>> Ah, if your SBI implementation is one of the affected ones, yeah.
> > >>> If not, you can just set NONPORTABLE :)
> > >> @Björn Töpel emitted the idea of excluding from the hibernation all
> > >> the memory nodes in the "/reserved-memory" node
> > >> (https://elixir.bootlin.com/linux/v6.4-rc1/source/Documentation/devicetree/bindings/reserved-memory/reserved-memory.yaml):
> > >> I have to admit that I don't see why it is not done by default by the
> > >> kernel.
> > > My understanding was that it was perfectly fine to use reserved memory
> > > nodes to fence off some memory to use in device drivers etc, which then
> > > may need to be saved/restored.
> >
> >
> > Agreed, but I would say that it's up to the driver then to take care of
> > that, see https://docs.kernel.org/driver-api/pm/notifiers.html
>
> I agree, it should be drivers responsibility to save/restore the dedicated
> reserved memory used by itself.
>
> Although, I think we should at least save/restore reserved memory
> regions having "reusable" property set.

Good point! I'll propose an RFC and gather feedback from the people in
charge of the hibernation process.

>
> Regards,
> Anup
>
> >
> >
> > >> Unless there is stuff in this node that needs to be "hibernated", I
> > >> think that would be a very good solution since we would not rely on
> > >> the name of the "internal" nodes of "/reserved-memory" (i.e.
> > >> "mmode_resv").
> > >>
> > >> I'm digging into why it is not done by default, just wanted to have
> > >> your feedback before the week-end :)

2023-05-26 15:36:23

by Alexandre Ghiti

[permalink] [raw]
Subject: Re: Bug report: kernel paniced when system hibernates


On 26/05/2023 16:59, Conor Dooley wrote:
> On Fri, May 26, 2023 at 03:14:33PM +0200, Alexandre Ghiti wrote:
>> Hi everyone,
>>
>> On Thu, May 25, 2023 at 11:24 PM Conor Dooley <[email protected]> wrote:
>>> On Thu, May 25, 2023 at 01:06:04PM -0700, Atish Patra wrote:
>>>> On Thu, May 25, 2023 at 11:39 AM Conor Dooley <[email protected]> wrote:
>>>>> On Thu, May 25, 2023 at 11:37:40AM -0700, Atish Patra wrote:
>>>>>
>>>>>> Any testing of hibernation still needs to revert the patch until we
>>>>>> have the proper fix.
>>>>> "the patch" is what exactly? I assume you don't mean depending on
>>>>> NONPORTABLE, since that is a Kconfig option.
>>>> Nope. Sorry I meant the commit
>>>>
>>>> 3335068 ("riscv: Use PUD/P4D/PGD pages for the linear mapping")
>>> Ah, if your SBI implementation is one of the affected ones, yeah.
>>> If not, you can just set NONPORTABLE :)
>> @Björn Töpel emitted the idea of excluding from the hibernation all
>> the memory nodes in the "/reserved-memory" node
>> (https://elixir.bootlin.com/linux/v6.4-rc1/source/Documentation/devicetree/bindings/reserved-memory/reserved-memory.yaml):
>> I have to admit that I don't see why it is not done by default by the
>> kernel.
> My understanding was that it was perfectly fine to use reserved memory
> nodes to fence off some memory to use in device drivers etc, which then
> may need to be saved/restored.


Agreed, but I would say that it's up to the driver then to take care of
that, see https://docs.kernel.org/driver-api/pm/notifiers.html


>> Unless there is stuff in this node that needs to be "hibernated", I
>> think that would be a very good solution since we would not rely on
>> the name of the "internal" nodes of "/reserved-memory" (i.e.
>> "mmode_resv").
>>
>> I'm digging into why it is not done by default, just wanted to have
>> your feedback before the week-end :)

2023-05-26 19:10:27

by Atish Patra

[permalink] [raw]
Subject: Re: Bug report: kernel paniced when system hibernates

On Fri, May 26, 2023 at 8:22 AM Alexandre Ghiti <[email protected]> wrote:
>
> On Fri, May 26, 2023 at 5:17 PM Anup Patel <[email protected]> wrote:
> >
> > On Fri, May 26, 2023 at 8:42 PM Alexandre Ghiti <[email protected]> wrote:
> > >
> > >
> > > On 26/05/2023 16:59, Conor Dooley wrote:
> > > > On Fri, May 26, 2023 at 03:14:33PM +0200, Alexandre Ghiti wrote:
> > > >> Hi everyone,
> > > >>
> > > >> On Thu, May 25, 2023 at 11:24 PM Conor Dooley <[email protected]> wrote:
> > > >>> On Thu, May 25, 2023 at 01:06:04PM -0700, Atish Patra wrote:
> > > >>>> On Thu, May 25, 2023 at 11:39 AM Conor Dooley <[email protected]> wrote:
> > > >>>>> On Thu, May 25, 2023 at 11:37:40AM -0700, Atish Patra wrote:
> > > >>>>>
> > > >>>>>> Any testing of hibernation still needs to revert the patch until we
> > > >>>>>> have the proper fix.
> > > >>>>> "the patch" is what exactly? I assume you don't mean depending on
> > > >>>>> NONPORTABLE, since that is a Kconfig option.
> > > >>>> Nope. Sorry I meant the commit
> > > >>>>
> > > >>>> 3335068 ("riscv: Use PUD/P4D/PGD pages for the linear mapping")
> > > >>> Ah, if your SBI implementation is one of the affected ones, yeah.
> > > >>> If not, you can just set NONPORTABLE :)
> > > >> @Björn Töpel emitted the idea of excluding from the hibernation all
> > > >> the memory nodes in the "/reserved-memory" node
> > > >> (https://elixir.bootlin.com/linux/v6.4-rc1/source/Documentation/devicetree/bindings/reserved-memory/reserved-memory.yaml):
> > > >> I have to admit that I don't see why it is not done by default by the
> > > >> kernel.
> > > > My understanding was that it was perfectly fine to use reserved memory
> > > > nodes to fence off some memory to use in device drivers etc, which then
> > > > may need to be saved/restored.
> > >
> > >
> > > Agreed, but I would say that it's up to the driver then to take care of
> > > that, see https://docs.kernel.org/driver-api/pm/notifiers.html
> >
> > I agree, it should be drivers responsibility to save/restore the dedicated
> > reserved memory used by itself.
> >
> > Although, I think we should at least save/restore reserved memory
> > regions having "reusable" property set.
>

That would be certainly ideal. However, that piece of code has been
present for ages (last commit was in 2008!).
There may be a bunch of drivers written with this builtin assumption.

> Good point! I'll propose an RFC and gather feedback from the people in
> charge of the hibernation process.

Hopefully, it's not too bad.

>
> >
> > Regards,
> > Anup
> >
> > >
> > >
> > > >> Unless there is stuff in this node that needs to be "hibernated", I
> > > >> think that would be a very good solution since we would not rely on
> > > >> the name of the "internal" nodes of "/reserved-memory" (i.e.
> > > >> "mmode_resv").
> > > >>
> > > >> I'm digging into why it is not done by default, just wanted to have
> > > >> your feedback before the week-end :)



--
Regards,
Atish