2024-04-12 14:57:29

by Björn Töpel

[permalink] [raw]
Subject: riscv32 EXT4 splat, 6.8 regression?

Hi!

I've been looking at an EXT4 splat on riscv32, that LKFT found [1]:

| EXT4-fs (vda): mounted filesystem 13697a42-d10e-4a9e-8e56-cb9083be92f9 ro with ordered data mode. Quota mode: disabled.
| VFS: Mounted root (ext4 filesystem) readonly on device 254:0.
| Unable to handle kernel NULL pointer dereference at virtual address 00000006
| Oops [#1]
| Modules linked in:
| CPU: 1 PID: 1 Comm: swapper/0 Not tainted 6.8.0 #41
| Hardware name: riscv-virtio,qemu (DT)
| epc : ext4_search_dir+0x52/0xe4
| ra : __ext4_find_entry+0x1d6/0x578
| epc : c035b60e ra : c035b876 sp : c253fc10
| gp : c21a7380 tp : c25c8000 t0 : 44c0657f
| t1 : 0000000c t2 : 1de5b089 s0 : c253fc50
| s1 : 00000000 a0 : fffffffc a1 : fffff000
| a2 : 00000000 a3 : c29c04f8 a4 : c253fd00
| a5 : 00000000 a6 : c253fcfc a7 : fffffff3
| s2 : 00001000 s3 : 00000000 s4 : 00001000
| s5 : c29c04f8 s6 : c292db40 s7 : c253fcfc
| s8 : fffffff7 s9 : c253fd00 s10: fffff000
| s11: c292db40 t3 : 00000007 t4 : 5e8b4525
| t5 : 00000000 t6 : 00000000
| status: 00000120 badaddr: 00000006 cause: 0000000d
| [<c035b60e>] ext4_search_dir+0x52/0xe4
| [<c035b876>] __ext4_find_entry+0x1d6/0x578
| [<c035bcaa>] ext4_lookup+0x92/0x200
| [<c0295c14>] __lookup_slow+0x8e/0x142
| [<c029943a>] walk_component+0x104/0x174
| [<c0299f18>] path_lookupat+0x78/0x182
| [<c029b24c>] filename_lookup+0x96/0x158
| [<c029b346>] kern_path+0x38/0x56
| [<c0c1bee4>] init_mount+0x46/0x96
| [<c0c2ae1c>] devtmpfs_mount+0x44/0x7a
| [<c0c01c26>] prepare_namespace+0x226/0x27c
| [<c0c01130>] kernel_init_freeable+0x27e/0x2a0
| [<c0b78402>] kernel_init+0x2a/0x158
| [<c0b82bf2>] ret_from_fork+0xe/0x20
| Code: 84ae a809 d303 0044 949a 0f63 0603 991a fd63 0584 (c603) 0064
| ---[ end trace 0000000000000000 ]---
| Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b

This was not present in 6.7. Bisection wasn't really helpful (to me at
least); I got it down to commit c604110e662a ("Merge tag 'vfs-6.8.misc'
of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs"), and when I
revert the commits in the vfs merge the splat went away, but I *really*
struggle to see how those are related...

What I see in ext4_search_dir() is that search_buf is 0xfffff000, and at
some point the address wraps to zero, and boom. I doubt that 0xfffff000
is a sane address.

Maybe this is something the the fs folks can spot directly? In the
meantime I'll continue to dig...


Thanks, and have a nice weeked!
Björn


[1] https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-6.8.y/build/v6.8.4-281-g6d08df6c401e/testrun/23369914/suite/log-parser-test/tests/


2024-04-12 15:44:18

by Theodore Ts'o

[permalink] [raw]
Subject: Re: riscv32 EXT4 splat, 6.8 regression?

On Fri, Apr 12, 2024 at 04:57:08PM +0200, Bj?rn T?pel wrote:
> Hi!
>
> I've been looking at an EXT4 splat on riscv32, that LKFT found [1]:

I'm getting a "page not found" for [1]?

> This was not present in 6.7. Bisection wasn't really helpful (to me at
> least); I got it down to commit c604110e662a ("Merge tag 'vfs-6.8.misc'
> of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs"), and when I
> revert the commits in the vfs merge the splat went away, but I *really*
> struggle to see how those are related...

It sounds like you have a reliable repro; is it something that can be
streamlined into a simple test program? If so, is it something that
can be reproduced on other architectures? And could you make it
available?

Thanks,

- Ted

2024-04-12 16:59:33

by Björn Töpel

[permalink] [raw]
Subject: Re: riscv32 EXT4 splat, 6.8 regression?

"Theodore Ts'o" <[email protected]> writes:

> On Fri, Apr 12, 2024 at 04:57:08PM +0200, Björn Töpel wrote:
>> Hi!
>>
>> I've been looking at an EXT4 splat on riscv32, that LKFT found [1]:
>
> I'm getting a "page not found" for [1]?

You are? It's working for me!

>> This was not present in 6.7. Bisection wasn't really helpful (to me at
>> least); I got it down to commit c604110e662a ("Merge tag 'vfs-6.8.misc'
>> of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs"), and when I
>> revert the commits in the vfs merge the splat went away, but I *really*
>> struggle to see how those are related...
>
> It sounds like you have a reliable repro; is it something that can be
> streamlined into a simple test program? If so, is it something that
> can be reproduced on other architectures? And could you make it
> available?

It's kind of streamlined: Linaro has this nice "tuxrun" tool, that can
be installed via pip, e.g.

$ pipx install tuxrun

if you're on Debian.

Then you can get the splat by running:

$ tuxrun --runtime docker --device qemu-riscv32 --kernel https://storage.tuxsuite.com/public/linaro/lkft/builds/2esMBaAMQJpcmczj0aL94fp4QnP/Image.gz --parameters SKIPFILE=skipfile-lkft.yaml --parameters SHARD_NUMBER=10 --parameters SHARD_INDEX=1 --image docker.io/linaro/tuxrun-dispatcher:v0.66.1 --tests ltp-controllers

(--runtime knows "podman" as well)

You can pass your own kernel to --kernel, and the config for riscv32 can
be obtained here [2].

Build with "make ARCH=riscv CROSS_COMPILE=riscv64-linux-gnu-", and make
sure to have the riscv64 cross-compilation support (yes, same toolchain
for rv32!).

It's when the rootfs is mounted, and the kernel is looking an init.


I'll keep debugging -- it was more if anyone had seen it before. I'll
try to reproduce on some other 32b platform as well.


Thanks,
Björn

[2] https://storage.tuxsuite.com/public/linaro/lkft/builds/2esMBaAMQJpcmczj0aL94fp4QnP/config

2024-04-13 04:36:25

by Theodore Ts'o

[permalink] [raw]
Subject: Re: riscv32 EXT4 splat, 6.8 regression?

On Fri, Apr 12, 2024 at 06:59:19PM +0200, Bj?rn T?pel wrote:
>
> $ pipx install tuxrun
>
> if you're on Debian.
>
> Then you can get the splat by running:
>
> $ tuxrun --runtime docker --device qemu-riscv32 --kernel https://storage.tuxsuite.com/public/linaro/lkft/builds/2esMBaAMQJpcmczj0aL94fp4QnP/Image.gz --parameters SKIPFILE=skipfile-lkft.yaml --parameters SHARD_NUMBER=10 --parameters SHARD_INDEX=1 --image docker.io/linaro/tuxrun-dispatcher:v0.66.1 --tests ltp-controllers

Yeah, what I was hoping for was a shell script or a .c file hich was
the reproducer, because that way I can run the test in my test infrastructure [1]

[1] https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-xfstests.md

I'm sure there are plenty of nice things about tuxrun, but with
kvm-xfstests I can easily get a shell so I can run the test sccript by
hand, perhaps with strace so I can see what is going on. Or I attach
gdb to the kernel via "gdb /path/to/vmlinux" and "target remote
localhost:7499".

I'm guessing that "ltp-controllers" means that the test might be from
the Linux Test Project? If so, that's great because I've added ltp
support to my test infrastructure (which also supports blktests,
phoronix test suite, and can be run on gce and on android devices in
addition to qemu, and on the arm64, i386, and x86_64 architectures).

> Build with "make ARCH=riscv CROSS_COMPILE=riscv64-linux-gnu-", and make
> sure to have the riscv64 cross-compilation support (yes, same toolchain
> for rv32!).
>
> It's when the rootfs is mounted, and the kernel is looking an init.

Hmm, so this happening as soon as the VM starts, before actually
starting to run any tests? Is it possible for you to send me the
rootfs as a downloading image, as opposed to my trying to paw through
the docker image?

> I'll keep debugging -- it was more if anyone had seen it before. I'll
> try to reproduce on some other 32b platform as well.

Well, it's not happening on my rootfs on i386 using my test infrastructure:

% cd /usr/projects/linux/ext4
% git checkout v6.8
% install-kconfig --arch i386
% kbuild --arch i386
% kvm-xfstests shell
...
root@kvm-xfstests:~# cd ltp
root@kvm-xfstests:~# ./runltp

(I don't have ltp support fully automated the way I can run blktests
using "kvm-xfstests --blktests" or run xfstests via "gce-xfstests -c
ext4/all -g auto". The main missing is teaching ltp to create an
junit xml results file so that the test results can be summarized and
so the test results can be more easily summarized and compared against
past runs on different kernel versions.)

Anyway, if you can send me your rootfs, I can try to take a look at it.

- Ted

2024-04-13 10:01:32

by Conor Dooley

[permalink] [raw]
Subject: Re: riscv32 EXT4 splat, 6.8 regression?

On Sat, Apr 13, 2024 at 12:35:42AM -0400, Theodore Ts'o wrote:
> On Fri, Apr 12, 2024 at 06:59:19PM +0200, Bj?rn T?pel wrote:
> >
> > $ pipx install tuxrun
> >
> > if you're on Debian.
> >
> > Then you can get the splat by running:
> >
> > $ tuxrun --runtime docker --device qemu-riscv32 --kernel https://storage.tuxsuite.com/public/linaro/lkft/builds/2esMBaAMQJpcmczj0aL94fp4QnP/Image.gz --parameters SKIPFILE=skipfile-lkft.yaml --parameters SHARD_NUMBER=10 --parameters SHARD_INDEX=1 --image docker.io/linaro/tuxrun-dispatcher:v0.66.1 --tests ltp-controllers
>
> Yeah, what I was hoping for was a shell script or a .c file hich was
> the reproducer, because that way I can run the test in my test infrastructure [1]
>
> [1] https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-xfstests.md
>
> I'm sure there are plenty of nice things about tuxrun, but with
> kvm-xfstests I can easily get a shell so I can run the test sccript by
> hand, perhaps with strace so I can see what is going on. Or I attach
> gdb to the kernel via "gdb /path/to/vmlinux" and "target remote
> localhost:7499".
>
> I'm guessing that "ltp-controllers" means that the test might be from
> the Linux Test Project? If so, that's great because I've added ltp
> support to my test infrastructure (which also supports blktests,
> phoronix test suite, and can be run on gce and on android devices in
> addition to qemu, and on the arm64, i386, and x86_64 architectures).
>
> > Build with "make ARCH=riscv CROSS_COMPILE=riscv64-linux-gnu-", and make
> > sure to have the riscv64 cross-compilation support (yes, same toolchain
> > for rv32!).
> >
> > It's when the rootfs is mounted, and the kernel is looking an init.
>
> Hmm, so this happening as soon as the VM starts, before actually
> starting to run any tests? Is it possible for you to send me the
> rootfs as a downloading image, as opposed to my trying to paw through
> the docker image?
>
> > I'll keep debugging -- it was more if anyone had seen it before. I'll
> > try to reproduce on some other 32b platform as well.
>
> Well, it's not happening on my rootfs on i386 using my test infrastructure:
>
> % cd /usr/projects/linux/ext4
> % git checkout v6.8
> % install-kconfig --arch i386
> % kbuild --arch i386
> % kvm-xfstests shell
> ...
> root@kvm-xfstests:~# cd ltp
> root@kvm-xfstests:~# ./runltp
>
> (I don't have ltp support fully automated the way I can run blktests
> using "kvm-xfstests --blktests" or run xfstests via "gce-xfstests -c
> ext4/all -g auto". The main missing is teaching ltp to create an
> junit xml results file so that the test results can be summarized and
> so the test results can be more easily summarized and compared against
> past runs on different kernel versions.)
>
> Anyway, if you can send me your rootfs, I can try to take a look at it.

I think this should be the rootfs here:
https://drive.google.com/file/d/1HIo8EkAKY0xpTIIlwd9fXjRzmIdD7BUA/view?usp=sharing

I also attempted to bisect this and ended up at a slightly different
commit to Bjorn: 8c9440fea774 ("Merge tag 'vfs-6.8.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs")
That's a merge of 3f6984e7301f & 5bd3cf8cbc8a, both of which booted for
me. I also tried to bisect in reverse to find the fix a la syzbot, since it
is not broken in 6.9, but that's pretty error prone and I ended up down
branches based on 6.7 and was not able to find the fix.


Attachments:
(No filename) (3.42 kB)
signature.asc (235.00 B)
Download all attachments

2024-04-13 14:43:36

by Nam Cao

[permalink] [raw]
Subject: Re: riscv32 EXT4 splat, 6.8 regression?

On 2024-04-12 Björn Töpel wrote:
> Hi!
>
> I've been looking at an EXT4 splat on riscv32, that LKFT found [1]:
>
> | EXT4-fs (vda): mounted filesystem 13697a42-d10e-4a9e-8e56-cb9083be92f9 ro with ordered data mode. Quota mode: disabled.
> | VFS: Mounted root (ext4 filesystem) readonly on device 254:0.
> | Unable to handle kernel NULL pointer dereference at virtual address 00000006
> | Oops [#1]
> | Modules linked in:
> | CPU: 1 PID: 1 Comm: swapper/0 Not tainted 6.8.0 #41
> | Hardware name: riscv-virtio,qemu (DT)
> | epc : ext4_search_dir+0x52/0xe4
> | ra : __ext4_find_entry+0x1d6/0x578
> | epc : c035b60e ra : c035b876 sp : c253fc10
> | gp : c21a7380 tp : c25c8000 t0 : 44c0657f
> | t1 : 0000000c t2 : 1de5b089 s0 : c253fc50
> | s1 : 00000000 a0 : fffffffc a1 : fffff000
> | a2 : 00000000 a3 : c29c04f8 a4 : c253fd00
> | a5 : 00000000 a6 : c253fcfc a7 : fffffff3
> | s2 : 00001000 s3 : 00000000 s4 : 00001000
> | s5 : c29c04f8 s6 : c292db40 s7 : c253fcfc
> | s8 : fffffff7 s9 : c253fd00 s10: fffff000
> | s11: c292db40 t3 : 00000007 t4 : 5e8b4525
> | t5 : 00000000 t6 : 00000000
> | status: 00000120 badaddr: 00000006 cause: 0000000d
> | [<c035b60e>] ext4_search_dir+0x52/0xe4
> | [<c035b876>] __ext4_find_entry+0x1d6/0x578
> | [<c035bcaa>] ext4_lookup+0x92/0x200
> | [<c0295c14>] __lookup_slow+0x8e/0x142
> | [<c029943a>] walk_component+0x104/0x174
> | [<c0299f18>] path_lookupat+0x78/0x182
> | [<c029b24c>] filename_lookup+0x96/0x158
> | [<c029b346>] kern_path+0x38/0x56
> | [<c0c1bee4>] init_mount+0x46/0x96
> | [<c0c2ae1c>] devtmpfs_mount+0x44/0x7a
> | [<c0c01c26>] prepare_namespace+0x226/0x27c
> | [<c0c01130>] kernel_init_freeable+0x27e/0x2a0
> | [<c0b78402>] kernel_init+0x2a/0x158
> | [<c0b82bf2>] ret_from_fork+0xe/0x20
> | Code: 84ae a809 d303 0044 949a 0f63 0603 991a fd63 0584 (c603) 0064
> | ---[ end trace 0000000000000000 ]---
> | Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
>
> This was not present in 6.7. Bisection wasn't really helpful (to me at
> least); I got it down to commit c604110e662a ("Merge tag 'vfs-6.8.misc'
> of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs"), and when I
> revert the commits in the vfs merge the splat went away, but I *really*
> struggle to see how those are related...
>
> What I see in ext4_search_dir() is that search_buf is 0xfffff000, and at
> some point the address wraps to zero, and boom. I doubt that 0xfffff000
> is a sane address.

I have zero knowledge about file system, but I think it's an integer
overflow problem. The calculation of "dlimit" overflow and dlimit wraps
around, this leads to wrong comparison later on.

I guess that explains why your bisect and Conor's bisect results are
strange: the bug has been here for quite some time, but it only appears
when "dlimit" happens to overflow.

It can be fixed by re-arrange the comparisons a bit. Can you give the
below patch a try?

Best regards,
Nam

diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 05b647e6bc19..71b88b33b676 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -1532,15 +1532,13 @@ int ext4_search_dir(struct buffer_head *bh, char *search_buf, int buf_size,
unsigned int offset, struct ext4_dir_entry_2 **res_dir)
{
struct ext4_dir_entry_2 * de;
- char * dlimit;
int de_len;

de = (struct ext4_dir_entry_2 *)search_buf;
- dlimit = search_buf + buf_size;
- while ((char *) de < dlimit - EXT4_BASE_DIR_LEN) {
+ while ((char *) de - search_buf < buf_size - EXT4_BASE_DIR_LEN) {
/* this code is executed quadratically often */
/* do minimal checking `by hand' */
- if (de->name + de->name_len <= dlimit &&
+ if (de->name + de->name_len - search_buf <= buf_size &&
ext4_match(dir, fname, de)) {
/* found a match - just to be sure, do
* a full check */

2024-04-14 00:25:01

by Theodore Ts'o

[permalink] [raw]
Subject: Re: riscv32 EXT4 splat, 6.8 regression?

On Sat, Apr 13, 2024 at 04:43:18PM +0200, Nam Cao wrote:
>
> I have zero knowledge about file system, but I think it's an integer
> overflow problem. The calculation of "dlimit" overflow and dlimit wraps
> around, this leads to wrong comparison later on.
>
> I guess that explains why your bisect and Conor's bisect results are
> strange: the bug has been here for quite some time, but it only appears
> when "dlimit" happens to overflow.

So the problem with that theory is that for that to be the case
buf_size would have to be invalid, and it's unclear how could have
happened. We can try to test that theory by putting something like
this at the beginning of ext4_search_dir():

if (buf_size < 0 || buf_size > dir->i_sb->s_blocksize) {
/* should never happen */
EXT4_ERROR_INODE(dir, "insane buf_size %d", buf_size);
WARN_ON(1)
return -EFSCORRUPTED;
}

Just to confirm, this file system is not one that has been fuzzed or
is being dynamically modified while mounted, right? Even if that were
the case, looking at the stack trace, I don't see how this could have
happened. (I could imagine some scenario involving inline directoreis
and fuzzed or dynamically modified file systems might be a potential
problem=, or at least one that involve much more careful; code review,
since that code is not as battle tested as other parts of ext4; but
the stack trace reported at the beginning of this thread doesn't seem
to indicate that inline directories were involved.)

- Ted

2024-04-14 02:05:51

by Theodore Ts'o

[permalink] [raw]
Subject: Re: riscv32 EXT4 splat, 6.8 regression?

On Sat, Apr 13, 2024 at 07:46:03PM -0600, Andreas Dilger wrote:
> This looks like a straight-forward mathematical substitution of "dlimit"
> with "search_buf + buf_size" and rearranging of the terms to make the
> while loop offset "zero based" rather than "address based" and would
> avoid overflow if "search_buf" was within one 4kB block of overflow:
>
> dlimit = search_buf + buf_size = 0xfffff000 + 0x1000 = 0x00000000

Umm... maybe, but does riscv32 actually have a memory map where a
kernel page would actually have an address in high memory like that?
That seems.... unusual.

If we have a reliable reproduction, can someone actually printk the
address or test to see if this theory is correct?

- Ted

2024-04-14 02:16:14

by Al Viro

[permalink] [raw]
Subject: Re: riscv32 EXT4 splat, 6.8 regression?

On Sat, Apr 13, 2024 at 07:46:03PM -0600, Andreas Dilger wrote:

> As to whether the 0xfffff000 address itself is valid for riscv32 is
> outside my realm, but given that RAM is cheap it doesn't seem unlikely
> to have 4GB+ of RAM and want to use it all. The riscv32 might consider
> reserving this page address from allocation to avoid similar issues in
> other parts of the code, as is done with the NULL/0 page address.

Not a chance. *Any* page mapped there is a serious bug on any 32bit
box. Recall what ERR_PTR() is...

On any architecture the virtual addresses in range (unsigned long)-512..
(unsigned long)-1 must never resolve to valid kernel objects.
In other words, any kind of wraparound here is asking for an oops on
attempts to access the elements of buffer - kernel dereference of
(char *)0xfffff000 on a 32bit box is already a bug.

It might be getting an invalid pointer, but arithmetical overflows
are irrelevant.

2024-04-14 04:16:23

by Andreas Dilger

[permalink] [raw]
Subject: Re: riscv32 EXT4 splat, 6.8 regression?

On Apr 13, 2024, at 8:15 PM, Al Viro <[email protected]> wrote:
>
> On Sat, Apr 13, 2024 at 07:46:03PM -0600, Andreas Dilger wrote:
>
>> As to whether the 0xfffff000 address itself is valid for riscv32 is
>> outside my realm, but given that RAM is cheap it doesn't seem unlikely
>> to have 4GB+ of RAM and want to use it all. The riscv32 might consider
>> reserving this page address from allocation to avoid similar issues in
>> other parts of the code, as is done with the NULL/0 page address.
>
> Not a chance. *Any* page mapped there is a serious bug on any 32bit
> box. Recall what ERR_PTR() is...
>
> On any architecture the virtual addresses in range (unsigned long)-512..
> (unsigned long)-1 must never resolve to valid kernel objects.
> In other words, any kind of wraparound here is asking for an oops on
> attempts to access the elements of buffer - kernel dereference of
> (char *)0xfffff000 on a 32bit box is already a bug.
>
> It might be getting an invalid pointer, but arithmetical overflows
> are irrelevant.

The original bug report stated that search_buf = 0xfffff000 on entry,
and I'd quoted that at the start of my email:

On Apr 12, 2024, at 8:57 AM, Björn Töpel <[email protected]> wrote:
> What I see in ext4_search_dir() is that search_buf is 0xfffff000, and at
> some point the address wraps to zero, and boom. I doubt that 0xfffff000
> is a sane address.

Now that you mention ERR_PTR() it definitely makes sense that this last
page HAS to be excluded.

So some other bug is passing the bad pointer to this code before this
error, or the arch is not correctly excluding this page from allocation.

Cheers, Andreas






Attachments:
signature.asc (890.00 B)
Message signed with OpenPGP

2024-04-14 14:09:52

by Björn Töpel

[permalink] [raw]
Subject: Re: riscv32 EXT4 splat, 6.8 regression?

Andreas Dilger <[email protected]> writes:

> On Apr 13, 2024, at 8:15 PM, Al Viro <[email protected]> wrote:
>>
>> On Sat, Apr 13, 2024 at 07:46:03PM -0600, Andreas Dilger wrote:
>>
>>> As to whether the 0xfffff000 address itself is valid for riscv32 is
>>> outside my realm, but given that RAM is cheap it doesn't seem unlikely
>>> to have 4GB+ of RAM and want to use it all. The riscv32 might consider
>>> reserving this page address from allocation to avoid similar issues in
>>> other parts of the code, as is done with the NULL/0 page address.
>>
>> Not a chance. *Any* page mapped there is a serious bug on any 32bit
>> box. Recall what ERR_PTR() is...
>>
>> On any architecture the virtual addresses in range (unsigned long)-512..
>> (unsigned long)-1 must never resolve to valid kernel objects.
>> In other words, any kind of wraparound here is asking for an oops on
>> attempts to access the elements of buffer - kernel dereference of
>> (char *)0xfffff000 on a 32bit box is already a bug.
>>
>> It might be getting an invalid pointer, but arithmetical overflows
>> are irrelevant.
>
> The original bug report stated that search_buf = 0xfffff000 on entry,
> and I'd quoted that at the start of my email:
>
> On Apr 12, 2024, at 8:57 AM, Björn Töpel <[email protected]> wrote:
>> What I see in ext4_search_dir() is that search_buf is 0xfffff000, and at
>> some point the address wraps to zero, and boom. I doubt that 0xfffff000
>> is a sane address.
>
> Now that you mention ERR_PTR() it definitely makes sense that this last
> page HAS to be excluded.
>
> So some other bug is passing the bad pointer to this code before this
> error, or the arch is not correctly excluding this page from allocation.

Yeah, something is off for sure.

(FWIW, I manage to hit this for Linus' master as well.)

I added a print (close to trace_mm_filemap_add_to_page_cache()), and for
this BT:

[<c01e8b34>] __filemap_add_folio+0x322/0x508
[<c01e8d6e>] filemap_add_folio+0x54/0xce
[<c01ea076>] __filemap_get_folio+0x156/0x2aa
[<c02df346>] __getblk_slow+0xcc/0x302
[<c02df5f2>] bdev_getblk+0x76/0x7a
[<c03519da>] ext4_getblk+0xbc/0x2c4
[<c0351cc2>] ext4_bread_batch+0x56/0x186
[<c036bcaa>] __ext4_find_entry+0x156/0x578
[<c036c152>] ext4_lookup+0x86/0x1f4
[<c02a3252>] __lookup_slow+0x8e/0x142
[<c02a6d70>] walk_component+0x104/0x174
[<c02a793c>] path_lookupat+0x78/0x182
[<c02a8c7c>] filename_lookup+0x96/0x158
[<c02a8d76>] kern_path+0x38/0x56
[<c0c1cb7a>] init_mount+0x5c/0xac
[<c0c2ba4c>] devtmpfs_mount+0x44/0x7a
[<c0c01cce>] prepare_namespace+0x226/0x27c
[<c0c011c6>] kernel_init_freeable+0x286/0x2a8
[<c0b97ab8>] kernel_init+0x2a/0x156
[<c0ba22ca>] ret_from_fork+0xe/0x20

I get a folio where folio_address(folio) == 0xfffff000 (which is
broken).

Need to go into the weeds here...


Björn

2024-04-15 13:21:27

by Christian Brauner

[permalink] [raw]
Subject: Re: riscv32 EXT4 splat, 6.8 regression?

On Sun, Apr 14, 2024 at 04:08:11PM +0200, Björn Töpel wrote:
> Andreas Dilger <[email protected]> writes:
>
> > On Apr 13, 2024, at 8:15 PM, Al Viro <[email protected]> wrote:
> >>
> >> On Sat, Apr 13, 2024 at 07:46:03PM -0600, Andreas Dilger wrote:
> >>
> >>> As to whether the 0xfffff000 address itself is valid for riscv32 is
> >>> outside my realm, but given that RAM is cheap it doesn't seem unlikely
> >>> to have 4GB+ of RAM and want to use it all. The riscv32 might consider
> >>> reserving this page address from allocation to avoid similar issues in
> >>> other parts of the code, as is done with the NULL/0 page address.
> >>
> >> Not a chance. *Any* page mapped there is a serious bug on any 32bit
> >> box. Recall what ERR_PTR() is...
> >>
> >> On any architecture the virtual addresses in range (unsigned long)-512..
> >> (unsigned long)-1 must never resolve to valid kernel objects.
> >> In other words, any kind of wraparound here is asking for an oops on
> >> attempts to access the elements of buffer - kernel dereference of
> >> (char *)0xfffff000 on a 32bit box is already a bug.
> >>
> >> It might be getting an invalid pointer, but arithmetical overflows
> >> are irrelevant.
> >
> > The original bug report stated that search_buf = 0xfffff000 on entry,
> > and I'd quoted that at the start of my email:
> >
> > On Apr 12, 2024, at 8:57 AM, Björn Töpel <[email protected]> wrote:
> >> What I see in ext4_search_dir() is that search_buf is 0xfffff000, and at
> >> some point the address wraps to zero, and boom. I doubt that 0xfffff000
> >> is a sane address.
> >
> > Now that you mention ERR_PTR() it definitely makes sense that this last
> > page HAS to be excluded.
> >
> > So some other bug is passing the bad pointer to this code before this
> > error, or the arch is not correctly excluding this page from allocation.
>
> Yeah, something is off for sure.
>
> (FWIW, I manage to hit this for Linus' master as well.)
>
> I added a print (close to trace_mm_filemap_add_to_page_cache()), and for
> this BT:
>
> [<c01e8b34>] __filemap_add_folio+0x322/0x508
> [<c01e8d6e>] filemap_add_folio+0x54/0xce
> [<c01ea076>] __filemap_get_folio+0x156/0x2aa
> [<c02df346>] __getblk_slow+0xcc/0x302
> [<c02df5f2>] bdev_getblk+0x76/0x7a
> [<c03519da>] ext4_getblk+0xbc/0x2c4
> [<c0351cc2>] ext4_bread_batch+0x56/0x186
> [<c036bcaa>] __ext4_find_entry+0x156/0x578
> [<c036c152>] ext4_lookup+0x86/0x1f4
> [<c02a3252>] __lookup_slow+0x8e/0x142
> [<c02a6d70>] walk_component+0x104/0x174
> [<c02a793c>] path_lookupat+0x78/0x182
> [<c02a8c7c>] filename_lookup+0x96/0x158
> [<c02a8d76>] kern_path+0x38/0x56
> [<c0c1cb7a>] init_mount+0x5c/0xac
> [<c0c2ba4c>] devtmpfs_mount+0x44/0x7a
> [<c0c01cce>] prepare_namespace+0x226/0x27c
> [<c0c011c6>] kernel_init_freeable+0x286/0x2a8
> [<c0b97ab8>] kernel_init+0x2a/0x156
> [<c0ba22ca>] ret_from_fork+0xe/0x20
>
> I get a folio where folio_address(folio) == 0xfffff000 (which is
> broken).
>
> Need to go into the weeds here...

I don't see anything obvious that could explain this right away. Did you
manage to reproduce this on any other architecture and/or filesystem?

Fwiw, iirc there were a bunch of fs/buffer.c changes that came in
through the mm/ layer between v6.7 and v6.8 that might also be
interesting. But really I'm poking in the dark currently.

2024-04-15 16:05:06

by Björn Töpel

[permalink] [raw]
Subject: Re: riscv32 EXT4 splat, 6.8 regression?

Christian Brauner <[email protected]> writes:

> On Sun, Apr 14, 2024 at 04:08:11PM +0200, Björn Töpel wrote:
>> Andreas Dilger <[email protected]> writes:
>>
>> > On Apr 13, 2024, at 8:15 PM, Al Viro <[email protected]> wrote:
>> >>
>> >> On Sat, Apr 13, 2024 at 07:46:03PM -0600, Andreas Dilger wrote:
>> >>
>> >>> As to whether the 0xfffff000 address itself is valid for riscv32 is
>> >>> outside my realm, but given that RAM is cheap it doesn't seem unlikely
>> >>> to have 4GB+ of RAM and want to use it all. The riscv32 might consider
>> >>> reserving this page address from allocation to avoid similar issues in
>> >>> other parts of the code, as is done with the NULL/0 page address.
>> >>
>> >> Not a chance. *Any* page mapped there is a serious bug on any 32bit
>> >> box. Recall what ERR_PTR() is...
>> >>
>> >> On any architecture the virtual addresses in range (unsigned long)-512..
>> >> (unsigned long)-1 must never resolve to valid kernel objects.
>> >> In other words, any kind of wraparound here is asking for an oops on
>> >> attempts to access the elements of buffer - kernel dereference of
>> >> (char *)0xfffff000 on a 32bit box is already a bug.
>> >>
>> >> It might be getting an invalid pointer, but arithmetical overflows
>> >> are irrelevant.
>> >
>> > The original bug report stated that search_buf = 0xfffff000 on entry,
>> > and I'd quoted that at the start of my email:
>> >
>> > On Apr 12, 2024, at 8:57 AM, Björn Töpel <[email protected]> wrote:
>> >> What I see in ext4_search_dir() is that search_buf is 0xfffff000, and at
>> >> some point the address wraps to zero, and boom. I doubt that 0xfffff000
>> >> is a sane address.
>> >
>> > Now that you mention ERR_PTR() it definitely makes sense that this last
>> > page HAS to be excluded.
>> >
>> > So some other bug is passing the bad pointer to this code before this
>> > error, or the arch is not correctly excluding this page from allocation.
>>
>> Yeah, something is off for sure.
>>
>> (FWIW, I manage to hit this for Linus' master as well.)
>>
>> I added a print (close to trace_mm_filemap_add_to_page_cache()), and for
>> this BT:
>>
>> [<c01e8b34>] __filemap_add_folio+0x322/0x508
>> [<c01e8d6e>] filemap_add_folio+0x54/0xce
>> [<c01ea076>] __filemap_get_folio+0x156/0x2aa
>> [<c02df346>] __getblk_slow+0xcc/0x302
>> [<c02df5f2>] bdev_getblk+0x76/0x7a
>> [<c03519da>] ext4_getblk+0xbc/0x2c4
>> [<c0351cc2>] ext4_bread_batch+0x56/0x186
>> [<c036bcaa>] __ext4_find_entry+0x156/0x578
>> [<c036c152>] ext4_lookup+0x86/0x1f4
>> [<c02a3252>] __lookup_slow+0x8e/0x142
>> [<c02a6d70>] walk_component+0x104/0x174
>> [<c02a793c>] path_lookupat+0x78/0x182
>> [<c02a8c7c>] filename_lookup+0x96/0x158
>> [<c02a8d76>] kern_path+0x38/0x56
>> [<c0c1cb7a>] init_mount+0x5c/0xac
>> [<c0c2ba4c>] devtmpfs_mount+0x44/0x7a
>> [<c0c01cce>] prepare_namespace+0x226/0x27c
>> [<c0c011c6>] kernel_init_freeable+0x286/0x2a8
>> [<c0b97ab8>] kernel_init+0x2a/0x156
>> [<c0ba22ca>] ret_from_fork+0xe/0x20
>>
>> I get a folio where folio_address(folio) == 0xfffff000 (which is
>> broken).
>>
>> Need to go into the weeds here...
>
> I don't see anything obvious that could explain this right away. Did you
> manage to reproduce this on any other architecture and/or filesystem?
>
> Fwiw, iirc there were a bunch of fs/buffer.c changes that came in
> through the mm/ layer between v6.7 and v6.8 that might also be
> interesting. But really I'm poking in the dark currently.

Thanks for getting back! Spent some more time one it today.

It seems that the buddy allocator *can* return a page with a VA that can
wrap (0xfffff000 -- pointed out by Nam and myself).

Further, it seems like riscv32 indeed inserts a page like that to the
buddy allocator, when the memblock is free'd:

| [<c024961c>] __free_one_page+0x2a4/0x3ea
| [<c024a448>] __free_pages_ok+0x158/0x3cc
| [<c024b1a4>] __free_pages_core+0xe8/0x12c
| [<c0c1435a>] memblock_free_pages+0x1a/0x22
| [<c0c17676>] memblock_free_all+0x1ee/0x278
| [<c0c050b0>] mem_init+0x10/0xa4
| [<c0c1447c>] mm_core_init+0x11a/0x2da
| [<c0c00bb6>] start_kernel+0x3c4/0x6de

Here, a page with VA 0xfffff000 is a added to the freelist. We were just
lucky (unlucky?) that page was used for the page cache.

A nasty patch like:
--8<--
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 549e76af8f82..a6a6abbe71b0 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -2566,6 +2566,9 @@ void __init set_dma_reserve(unsigned long new_dma_reserve)
void __init memblock_free_pages(struct page *page, unsigned long pfn,
unsigned int order)
{
+ if ((long)page_address(page) == 0xfffff000L) {
+ return; // leak it
+ }

if (IS_ENABLED(CONFIG_DEFERRED_STRUCT_PAGE_INIT)) {
int nid = early_pfn_to_nid(pfn);
--8<--

...and it's gone.

I need to think more about what a proper fix is. Regardless; Christian,
Al, and Ted can all relax. ;-)


Björn

2024-04-16 06:44:32

by Nam Cao

[permalink] [raw]
Subject: Re: riscv32 EXT4 splat, 6.8 regression?

On 2024-04-15 Björn Töpel wrote:
> Thanks for getting back! Spent some more time one it today.
>
> It seems that the buddy allocator *can* return a page with a VA that can
> wrap (0xfffff000 -- pointed out by Nam and myself).
>
> Further, it seems like riscv32 indeed inserts a page like that to the
> buddy allocator, when the memblock is free'd:
>
> | [<c024961c>] __free_one_page+0x2a4/0x3ea
> | [<c024a448>] __free_pages_ok+0x158/0x3cc
> | [<c024b1a4>] __free_pages_core+0xe8/0x12c
> | [<c0c1435a>] memblock_free_pages+0x1a/0x22
> | [<c0c17676>] memblock_free_all+0x1ee/0x278
> | [<c0c050b0>] mem_init+0x10/0xa4
> | [<c0c1447c>] mm_core_init+0x11a/0x2da
> | [<c0c00bb6>] start_kernel+0x3c4/0x6de
>
> Here, a page with VA 0xfffff000 is a added to the freelist. We were just
> lucky (unlucky?) that page was used for the page cache.

I just educated myself about memory mapping last night, so the below
may be complete nonsense. Take it with a grain of salt.

In riscv's setup_bootmem(), we have this line:
max_low_pfn = max_pfn = PFN_DOWN(phys_ram_end);

I think this is the root cause: max_low_pfn indicates the last page
to be mapped. Problem is: nothing prevents PFN_DOWN(phys_ram_end) from
getting mapped to the last page (0xfffff000). If max_low_pfn is mapped
to the last page, we get the reported problem.

There seems to be some code to make sure the last page is not used
(the call to memblock_set_current_limit() right above this line). It is
unclear to me why this still lets the problem slip through.

The fix is simple: never let max_low_pfn gets mapped to the last page.
The below patch fixes the problem for me. But I am not entirely sure if
this is the correct fix, further investigation needed.

Best regards,
Nam

diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index fa34cf55037b..17cab0a52726 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -251,7 +251,8 @@ static void __init setup_bootmem(void)
}

min_low_pfn = PFN_UP(phys_ram_base);
- max_low_pfn = max_pfn = PFN_DOWN(phys_ram_end);
+ max_low_pfn = PFN_DOWN(memblock_get_current_limit());
+ max_pfn = PFN_DOWN(phys_ram_end);
high_memory = (void *)(__va(PFN_PHYS(max_low_pfn)));

dma32_phys_limit = min(4UL * SZ_1G, (unsigned long)PFN_PHYS(max_low_pfn));

2024-04-16 08:26:16

by Christian Brauner

[permalink] [raw]
Subject: Re: riscv32 EXT4 splat, 6.8 regression?

[Adding Mike who's knowledgeable in this area]

On Mon, Apr 15, 2024 at 06:04:50PM +0200, Björn Töpel wrote:
> Christian Brauner <[email protected]> writes:
>
> > On Sun, Apr 14, 2024 at 04:08:11PM +0200, Björn Töpel wrote:
> >> Andreas Dilger <[email protected]> writes:
> >>
> >> > On Apr 13, 2024, at 8:15 PM, Al Viro <[email protected]> wrote:
> >> >>
> >> >> On Sat, Apr 13, 2024 at 07:46:03PM -0600, Andreas Dilger wrote:
> >> >>
> >> >>> As to whether the 0xfffff000 address itself is valid for riscv32 is
> >> >>> outside my realm, but given that RAM is cheap it doesn't seem unlikely
> >> >>> to have 4GB+ of RAM and want to use it all. The riscv32 might consider
> >> >>> reserving this page address from allocation to avoid similar issues in
> >> >>> other parts of the code, as is done with the NULL/0 page address.
> >> >>
> >> >> Not a chance. *Any* page mapped there is a serious bug on any 32bit
> >> >> box. Recall what ERR_PTR() is...
> >> >>
> >> >> On any architecture the virtual addresses in range (unsigned long)-512..
> >> >> (unsigned long)-1 must never resolve to valid kernel objects.
> >> >> In other words, any kind of wraparound here is asking for an oops on
> >> >> attempts to access the elements of buffer - kernel dereference of
> >> >> (char *)0xfffff000 on a 32bit box is already a bug.
> >> >>
> >> >> It might be getting an invalid pointer, but arithmetical overflows
> >> >> are irrelevant.
> >> >
> >> > The original bug report stated that search_buf = 0xfffff000 on entry,
> >> > and I'd quoted that at the start of my email:
> >> >
> >> > On Apr 12, 2024, at 8:57 AM, Björn Töpel <[email protected]> wrote:
> >> >> What I see in ext4_search_dir() is that search_buf is 0xfffff000, and at
> >> >> some point the address wraps to zero, and boom. I doubt that 0xfffff000
> >> >> is a sane address.
> >> >
> >> > Now that you mention ERR_PTR() it definitely makes sense that this last
> >> > page HAS to be excluded.
> >> >
> >> > So some other bug is passing the bad pointer to this code before this
> >> > error, or the arch is not correctly excluding this page from allocation.
> >>
> >> Yeah, something is off for sure.
> >>
> >> (FWIW, I manage to hit this for Linus' master as well.)
> >>
> >> I added a print (close to trace_mm_filemap_add_to_page_cache()), and for
> >> this BT:
> >>
> >> [<c01e8b34>] __filemap_add_folio+0x322/0x508
> >> [<c01e8d6e>] filemap_add_folio+0x54/0xce
> >> [<c01ea076>] __filemap_get_folio+0x156/0x2aa
> >> [<c02df346>] __getblk_slow+0xcc/0x302
> >> [<c02df5f2>] bdev_getblk+0x76/0x7a
> >> [<c03519da>] ext4_getblk+0xbc/0x2c4
> >> [<c0351cc2>] ext4_bread_batch+0x56/0x186
> >> [<c036bcaa>] __ext4_find_entry+0x156/0x578
> >> [<c036c152>] ext4_lookup+0x86/0x1f4
> >> [<c02a3252>] __lookup_slow+0x8e/0x142
> >> [<c02a6d70>] walk_component+0x104/0x174
> >> [<c02a793c>] path_lookupat+0x78/0x182
> >> [<c02a8c7c>] filename_lookup+0x96/0x158
> >> [<c02a8d76>] kern_path+0x38/0x56
> >> [<c0c1cb7a>] init_mount+0x5c/0xac
> >> [<c0c2ba4c>] devtmpfs_mount+0x44/0x7a
> >> [<c0c01cce>] prepare_namespace+0x226/0x27c
> >> [<c0c011c6>] kernel_init_freeable+0x286/0x2a8
> >> [<c0b97ab8>] kernel_init+0x2a/0x156
> >> [<c0ba22ca>] ret_from_fork+0xe/0x20
> >>
> >> I get a folio where folio_address(folio) == 0xfffff000 (which is
> >> broken).
> >>
> >> Need to go into the weeds here...
> >
> > I don't see anything obvious that could explain this right away. Did you
> > manage to reproduce this on any other architecture and/or filesystem?
> >
> > Fwiw, iirc there were a bunch of fs/buffer.c changes that came in
> > through the mm/ layer between v6.7 and v6.8 that might also be
> > interesting. But really I'm poking in the dark currently.
>
> Thanks for getting back! Spent some more time one it today.
>
> It seems that the buddy allocator *can* return a page with a VA that can
> wrap (0xfffff000 -- pointed out by Nam and myself).
>
> Further, it seems like riscv32 indeed inserts a page like that to the
> buddy allocator, when the memblock is free'd:
>
> | [<c024961c>] __free_one_page+0x2a4/0x3ea
> | [<c024a448>] __free_pages_ok+0x158/0x3cc
> | [<c024b1a4>] __free_pages_core+0xe8/0x12c
> | [<c0c1435a>] memblock_free_pages+0x1a/0x22
> | [<c0c17676>] memblock_free_all+0x1ee/0x278
> | [<c0c050b0>] mem_init+0x10/0xa4
> | [<c0c1447c>] mm_core_init+0x11a/0x2da
> | [<c0c00bb6>] start_kernel+0x3c4/0x6de
>
> Here, a page with VA 0xfffff000 is a added to the freelist. We were just
> lucky (unlucky?) that page was used for the page cache.
>
> A nasty patch like:
> --8<--
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index 549e76af8f82..a6a6abbe71b0 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -2566,6 +2566,9 @@ void __init set_dma_reserve(unsigned long new_dma_reserve)
> void __init memblock_free_pages(struct page *page, unsigned long pfn,
> unsigned int order)
> {
> + if ((long)page_address(page) == 0xfffff000L) {
> + return; // leak it
> + }
>
> if (IS_ENABLED(CONFIG_DEFERRED_STRUCT_PAGE_INIT)) {
> int nid = early_pfn_to_nid(pfn);
> --8<--
>
> ...and it's gone.
>
> I need to think more about what a proper fix is. Regardless; Christian,
> Al, and Ted can all relax. ;-)
>
>
> Björn

On Tue, Apr 16, 2024 at 08:44:17AM +0200, Nam Cao wrote:
> On 2024-04-15 Björn Töpel wrote:
> > Thanks for getting back! Spent some more time one it today.
> >
> > It seems that the buddy allocator *can* return a page with a VA that can
> > wrap (0xfffff000 -- pointed out by Nam and myself).
> >
> > Further, it seems like riscv32 indeed inserts a page like that to the
> > buddy allocator, when the memblock is free'd:
> >
> > | [<c024961c>] __free_one_page+0x2a4/0x3ea
> > | [<c024a448>] __free_pages_ok+0x158/0x3cc
> > | [<c024b1a4>] __free_pages_core+0xe8/0x12c
> > | [<c0c1435a>] memblock_free_pages+0x1a/0x22
> > | [<c0c17676>] memblock_free_all+0x1ee/0x278
> > | [<c0c050b0>] mem_init+0x10/0xa4
> > | [<c0c1447c>] mm_core_init+0x11a/0x2da
> > | [<c0c00bb6>] start_kernel+0x3c4/0x6de
> >
> > Here, a page with VA 0xfffff000 is a added to the freelist. We were just
> > lucky (unlucky?) that page was used for the page cache.
>
> I just educated myself about memory mapping last night, so the below
> may be complete nonsense. Take it with a grain of salt.
>
> In riscv's setup_bootmem(), we have this line:
> max_low_pfn = max_pfn = PFN_DOWN(phys_ram_end);
>
> I think this is the root cause: max_low_pfn indicates the last page
> to be mapped. Problem is: nothing prevents PFN_DOWN(phys_ram_end) from
> getting mapped to the last page (0xfffff000). If max_low_pfn is mapped
> to the last page, we get the reported problem.
>
> There seems to be some code to make sure the last page is not used
> (the call to memblock_set_current_limit() right above this line). It is
> unclear to me why this still lets the problem slip through.
>
> The fix is simple: never let max_low_pfn gets mapped to the last page.
> The below patch fixes the problem for me. But I am not entirely sure if
> this is the correct fix, further investigation needed.
>
> Best regards,
> Nam
>
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index fa34cf55037b..17cab0a52726 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -251,7 +251,8 @@ static void __init setup_bootmem(void)
> }
>
> min_low_pfn = PFN_UP(phys_ram_base);
> - max_low_pfn = max_pfn = PFN_DOWN(phys_ram_end);
> + max_low_pfn = PFN_DOWN(memblock_get_current_limit());
> + max_pfn = PFN_DOWN(phys_ram_end);
> high_memory = (void *)(__va(PFN_PHYS(max_low_pfn)));
>
> dma32_phys_limit = min(4UL * SZ_1G, (unsigned long)PFN_PHYS(max_low_pfn));

2024-04-16 11:02:34

by Björn Töpel

[permalink] [raw]
Subject: Re: riscv32 EXT4 splat, 6.8 regression?

Christian Brauner <[email protected]> writes:

> [Adding Mike who's knowledgeable in this area]

>> > Further, it seems like riscv32 indeed inserts a page like that to the
>> > buddy allocator, when the memblock is free'd:
>> >
>> > | [<c024961c>] __free_one_page+0x2a4/0x3ea
>> > | [<c024a448>] __free_pages_ok+0x158/0x3cc
>> > | [<c024b1a4>] __free_pages_core+0xe8/0x12c
>> > | [<c0c1435a>] memblock_free_pages+0x1a/0x22
>> > | [<c0c17676>] memblock_free_all+0x1ee/0x278
>> > | [<c0c050b0>] mem_init+0x10/0xa4
>> > | [<c0c1447c>] mm_core_init+0x11a/0x2da
>> > | [<c0c00bb6>] start_kernel+0x3c4/0x6de
>> >
>> > Here, a page with VA 0xfffff000 is a added to the freelist. We were just
>> > lucky (unlucky?) that page was used for the page cache.
>>
>> I just educated myself about memory mapping last night, so the below
>> may be complete nonsense. Take it with a grain of salt.
>>
>> In riscv's setup_bootmem(), we have this line:
>> max_low_pfn = max_pfn = PFN_DOWN(phys_ram_end);
>>
>> I think this is the root cause: max_low_pfn indicates the last page
>> to be mapped. Problem is: nothing prevents PFN_DOWN(phys_ram_end) from
>> getting mapped to the last page (0xfffff000). If max_low_pfn is mapped
>> to the last page, we get the reported problem.
>>
>> There seems to be some code to make sure the last page is not used
>> (the call to memblock_set_current_limit() right above this line). It is
>> unclear to me why this still lets the problem slip through.
>>
>> The fix is simple: never let max_low_pfn gets mapped to the last page.
>> The below patch fixes the problem for me. But I am not entirely sure if
>> this is the correct fix, further investigation needed.
>>
>> Best regards,
>> Nam
>>
>> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
>> index fa34cf55037b..17cab0a52726 100644
>> --- a/arch/riscv/mm/init.c
>> +++ b/arch/riscv/mm/init.c
>> @@ -251,7 +251,8 @@ static void __init setup_bootmem(void)
>> }
>>
>> min_low_pfn = PFN_UP(phys_ram_base);
>> - max_low_pfn = max_pfn = PFN_DOWN(phys_ram_end);
>> + max_low_pfn = PFN_DOWN(memblock_get_current_limit());
>> + max_pfn = PFN_DOWN(phys_ram_end);
>> high_memory = (void *)(__va(PFN_PHYS(max_low_pfn)));
>>
>> dma32_phys_limit = min(4UL * SZ_1G, (unsigned long)PFN_PHYS(max_low_pfn));

Yeah, AFAIU memblock_set_current_limit() only limits the allocation from
memblock. The "forbidden" page (PA 0xc03ff000 VA 0xfffff000) will still
be allowed in the zone.

I think your patch requires memblock_set_current_limit() is
unconditionally called, which currently is not done.

The hack I tried was (which seems to work):

--
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index fe8e159394d8..3a1f25d41794 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -245,8 +245,10 @@ static void __init setup_bootmem(void)
*/
if (!IS_ENABLED(CONFIG_64BIT)) {
max_mapped_addr = __pa(~(ulong)0);
- if (max_mapped_addr == (phys_ram_end - 1))
+ if (max_mapped_addr == (phys_ram_end - 1)) {
memblock_set_current_limit(max_mapped_addr - 4096);
+ phys_ram_end -= 4096;
+ }
}

min_low_pfn = PFN_UP(phys_ram_base);
--

I'd really like to see an actual MM person (Mike or Alex?) have some
input here, and not simply my pasta-on-wall approach. ;-)


Björn

2024-04-16 14:29:19

by Mike Rapoport

[permalink] [raw]
Subject: Re: riscv32 EXT4 splat, 6.8 regression?

Hi,

On Tue, Apr 16, 2024 at 01:02:20PM +0200, Bj?rn T?pel wrote:
> Christian Brauner <[email protected]> writes:
>
> > [Adding Mike who's knowledgeable in this area]
>
> >> > Further, it seems like riscv32 indeed inserts a page like that to the
> >> > buddy allocator, when the memblock is free'd:
> >> >
> >> > | [<c024961c>] __free_one_page+0x2a4/0x3ea
> >> > | [<c024a448>] __free_pages_ok+0x158/0x3cc
> >> > | [<c024b1a4>] __free_pages_core+0xe8/0x12c
> >> > | [<c0c1435a>] memblock_free_pages+0x1a/0x22
> >> > | [<c0c17676>] memblock_free_all+0x1ee/0x278
> >> > | [<c0c050b0>] mem_init+0x10/0xa4
> >> > | [<c0c1447c>] mm_core_init+0x11a/0x2da
> >> > | [<c0c00bb6>] start_kernel+0x3c4/0x6de
> >> >
> >> > Here, a page with VA 0xfffff000 is a added to the freelist. We were just
> >> > lucky (unlucky?) that page was used for the page cache.
> >>
> >> I just educated myself about memory mapping last night, so the below
> >> may be complete nonsense. Take it with a grain of salt.
> >>
> >> In riscv's setup_bootmem(), we have this line:
> >> max_low_pfn = max_pfn = PFN_DOWN(phys_ram_end);
> >>
> >> I think this is the root cause: max_low_pfn indicates the last page
> >> to be mapped. Problem is: nothing prevents PFN_DOWN(phys_ram_end) from
> >> getting mapped to the last page (0xfffff000). If max_low_pfn is mapped
> >> to the last page, we get the reported problem.
> >>
> >> There seems to be some code to make sure the last page is not used
> >> (the call to memblock_set_current_limit() right above this line). It is
> >> unclear to me why this still lets the problem slip through.
> >>
> >> The fix is simple: never let max_low_pfn gets mapped to the last page.
> >> The below patch fixes the problem for me. But I am not entirely sure if
> >> this is the correct fix, further investigation needed.
> >>
> >> Best regards,
> >> Nam
> >>
> >> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> >> index fa34cf55037b..17cab0a52726 100644
> >> --- a/arch/riscv/mm/init.c
> >> +++ b/arch/riscv/mm/init.c
> >> @@ -251,7 +251,8 @@ static void __init setup_bootmem(void)
> >> }
> >>
> >> min_low_pfn = PFN_UP(phys_ram_base);
> >> - max_low_pfn = max_pfn = PFN_DOWN(phys_ram_end);
> >> + max_low_pfn = PFN_DOWN(memblock_get_current_limit());
> >> + max_pfn = PFN_DOWN(phys_ram_end);
> >> high_memory = (void *)(__va(PFN_PHYS(max_low_pfn)));
> >>
> >> dma32_phys_limit = min(4UL * SZ_1G, (unsigned long)PFN_PHYS(max_low_pfn));
>
> Yeah, AFAIU memblock_set_current_limit() only limits the allocation from
> memblock. The "forbidden" page (PA 0xc03ff000 VA 0xfffff000) will still
> be allowed in the zone.
>
> I think your patch requires memblock_set_current_limit() is
> unconditionally called, which currently is not done.
>
> The hack I tried was (which seems to work):
>
> --
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index fe8e159394d8..3a1f25d41794 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -245,8 +245,10 @@ static void __init setup_bootmem(void)
> */
> if (!IS_ENABLED(CONFIG_64BIT)) {
> max_mapped_addr = __pa(~(ulong)0);
> - if (max_mapped_addr == (phys_ram_end - 1))
> + if (max_mapped_addr == (phys_ram_end - 1)) {
> memblock_set_current_limit(max_mapped_addr - 4096);
> + phys_ram_end -= 4096;
> + }
> }

You can just memblock_reserve() the last page of the first gigabyte, e.g.

if (!IS_ENABLED(CONFIG_64BIT)
memblock_reserve(SZ_1G - PAGE_SIZE, PAGE_SIZE);

The page will still be mapped, but it will never make it to the page
allocator.

The nice thing about it is, that memblock lets you to reserve regions that are
not necessarily populated, so there's no need to check where the actual RAM
ends.

>
> min_low_pfn = PFN_UP(phys_ram_base);
> --
>
> I'd really like to see an actual MM person (Mike or Alex?) have some
> input here, and not simply my pasta-on-wall approach. ;-)
>
>
> Bj?rn

--
Sincerely yours,
Mike.

2024-04-16 15:30:51

by Nam Cao

[permalink] [raw]
Subject: Re: riscv32 EXT4 splat, 6.8 regression?

On 2024-04-16 Nam Cao wrote:
> On 2024-04-16 Mike Rapoport wrote:
> > Hi,
> >
> > On Tue, Apr 16, 2024 at 01:02:20PM +0200, Björn Töpel wrote:
> > > Christian Brauner <[email protected]> writes:
> > >
> > > > [Adding Mike who's knowledgeable in this area]
> > >
> > > >> > Further, it seems like riscv32 indeed inserts a page like that to the
> > > >> > buddy allocator, when the memblock is free'd:
> > > >> >
> > > >> > | [<c024961c>] __free_one_page+0x2a4/0x3ea
> > > >> > | [<c024a448>] __free_pages_ok+0x158/0x3cc
> > > >> > | [<c024b1a4>] __free_pages_core+0xe8/0x12c
> > > >> > | [<c0c1435a>] memblock_free_pages+0x1a/0x22
> > > >> > | [<c0c17676>] memblock_free_all+0x1ee/0x278
> > > >> > | [<c0c050b0>] mem_init+0x10/0xa4
> > > >> > | [<c0c1447c>] mm_core_init+0x11a/0x2da
> > > >> > | [<c0c00bb6>] start_kernel+0x3c4/0x6de
> > > >> >
> > > >> > Here, a page with VA 0xfffff000 is a added to the freelist. We were just
> > > >> > lucky (unlucky?) that page was used for the page cache.
> > > >>
> > > >> I just educated myself about memory mapping last night, so the below
> > > >> may be complete nonsense. Take it with a grain of salt.
> > > >>
> > > >> In riscv's setup_bootmem(), we have this line:
> > > >> max_low_pfn = max_pfn = PFN_DOWN(phys_ram_end);
> > > >>
> > > >> I think this is the root cause: max_low_pfn indicates the last page
> > > >> to be mapped. Problem is: nothing prevents PFN_DOWN(phys_ram_end) from
> > > >> getting mapped to the last page (0xfffff000). If max_low_pfn is mapped
> > > >> to the last page, we get the reported problem.
> > > >>
> > > >> There seems to be some code to make sure the last page is not used
> > > >> (the call to memblock_set_current_limit() right above this line). It is
> > > >> unclear to me why this still lets the problem slip through.
> > > >>
> > > >> The fix is simple: never let max_low_pfn gets mapped to the last page.
> > > >> The below patch fixes the problem for me. But I am not entirely sure if
> > > >> this is the correct fix, further investigation needed.
> > > >>
> > > >> Best regards,
> > > >> Nam
> > > >>
> > > >> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> > > >> index fa34cf55037b..17cab0a52726 100644
> > > >> --- a/arch/riscv/mm/init.c
> > > >> +++ b/arch/riscv/mm/init.c
> > > >> @@ -251,7 +251,8 @@ static void __init setup_bootmem(void)
> > > >> }
> > > >>
> > > >> min_low_pfn = PFN_UP(phys_ram_base);
> > > >> - max_low_pfn = max_pfn = PFN_DOWN(phys_ram_end);
> > > >> + max_low_pfn = PFN_DOWN(memblock_get_current_limit());
> > > >> + max_pfn = PFN_DOWN(phys_ram_end);
> > > >> high_memory = (void *)(__va(PFN_PHYS(max_low_pfn)));
> > > >>
> > > >> dma32_phys_limit = min(4UL * SZ_1G, (unsigned long)PFN_PHYS(max_low_pfn));
> > >
> > > Yeah, AFAIU memblock_set_current_limit() only limits the allocation from
> > > memblock. The "forbidden" page (PA 0xc03ff000 VA 0xfffff000) will still
> > > be allowed in the zone.
> > >
> > > I think your patch requires memblock_set_current_limit() is
> > > unconditionally called, which currently is not done.
> > >
> > > The hack I tried was (which seems to work):
> > >
> > > --
> > > diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> > > index fe8e159394d8..3a1f25d41794 100644
> > > --- a/arch/riscv/mm/init.c
> > > +++ b/arch/riscv/mm/init.c
> > > @@ -245,8 +245,10 @@ static void __init setup_bootmem(void)
> > > */
> > > if (!IS_ENABLED(CONFIG_64BIT)) {
> > > max_mapped_addr = __pa(~(ulong)0);
> > > - if (max_mapped_addr == (phys_ram_end - 1))
> > > + if (max_mapped_addr == (phys_ram_end - 1)) {
> > > memblock_set_current_limit(max_mapped_addr - 4096);
> > > + phys_ram_end -= 4096;
> > > + }
> > > }
> >
> > You can just memblock_reserve() the last page of the first gigabyte, e.g.
>
> "last page of the first gigabyte" - why first gigabyte? Do you mean
> last page of *last* gigabyte?
>
> > if (!IS_ENABLED(CONFIG_64BIT)
> > memblock_reserve(SZ_1G - PAGE_SIZE, PAGE_SIZE);
> >
> > The page will still be mapped, but it will never make it to the page
> > allocator.
> >
> > The nice thing about it is, that memblock lets you to reserve regions that are
> > not necessarily populated, so there's no need to check where the actual RAM
> > ends.
>
> I tried the suggested code, it didn't work. I think there are 2
> mistakes:
> - last gigabyte, not first
> - memblock_reserve() takes physical addresses as arguments, not
> virtual addresses
>
> The below patch fixes the problem. Is this what you really meant?
>
> Best regards,
> Nam
>
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index fa34cf55037b..ac7efdd77be8 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -245,6 +245,7 @@ static void __init setup_bootmem(void)
> * be done as soon as the kernel mapping base address is determined.
> */
> if (!IS_ENABLED(CONFIG_64BIT)) {
> + memblock_reserve(__pa(-PAGE_SIZE), __pa(PAGE_SIZE));
^ oops, this
argument is size, not address. So using __pa is
wrong here. Somehow it still worked.

> max_mapped_addr = __pa(~(ulong)0);
> if (max_mapped_addr == (phys_ram_end - 1))
> memblock_set_current_limit(max_mapped_addr - 4096);
>

Fixed version:

diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index fa34cf55037b..af4192bc51d0 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -245,6 +245,7 @@ static void __init setup_bootmem(void)
* be done as soon as the kernel mapping base address is determined.
*/
if (!IS_ENABLED(CONFIG_64BIT)) {
+ memblock_reserve(__pa(-PAGE_SIZE), PAGE_SIZE);
max_mapped_addr = __pa(~(ulong)0);
if (max_mapped_addr == (phys_ram_end - 1))
memblock_set_current_limit(max_mapped_addr - 4096);



2024-04-16 16:01:05

by Björn Töpel

[permalink] [raw]
Subject: Re: riscv32 EXT4 splat, 6.8 regression?

Nam Cao <[email protected]> writes:

> Fixed version:
>
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index fa34cf55037b..af4192bc51d0 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -245,6 +245,7 @@ static void __init setup_bootmem(void)
> * be done as soon as the kernel mapping base address is determined.
> */
> if (!IS_ENABLED(CONFIG_64BIT)) {
> + memblock_reserve(__pa(-PAGE_SIZE), PAGE_SIZE);
> max_mapped_addr = __pa(~(ulong)0);
> if (max_mapped_addr == (phys_ram_end - 1))
> memblock_set_current_limit(max_mapped_addr - 4096);

Nice!

Can't we get rid of the if-statement, and max_mapped_address as well?


Björn

2024-04-16 16:21:16

by Mike Rapoport

[permalink] [raw]
Subject: Re: riscv32 EXT4 splat, 6.8 regression?

On Tue, Apr 16, 2024 at 05:17:13PM +0200, Nam Cao wrote:
> On 2024-04-16 Mike Rapoport wrote:
> > Hi,
> >
> > On Tue, Apr 16, 2024 at 01:02:20PM +0200, Bj?rn T?pel wrote:
> > > Christian Brauner <[email protected]> writes:
> > >
> > > > [Adding Mike who's knowledgeable in this area]
> > >
> > > >> > Further, it seems like riscv32 indeed inserts a page like that to the
> > > >> > buddy allocator, when the memblock is free'd:
> > > >> >
> > > >> > | [<c024961c>] __free_one_page+0x2a4/0x3ea
> > > >> > | [<c024a448>] __free_pages_ok+0x158/0x3cc
> > > >> > | [<c024b1a4>] __free_pages_core+0xe8/0x12c
> > > >> > | [<c0c1435a>] memblock_free_pages+0x1a/0x22
> > > >> > | [<c0c17676>] memblock_free_all+0x1ee/0x278
> > > >> > | [<c0c050b0>] mem_init+0x10/0xa4
> > > >> > | [<c0c1447c>] mm_core_init+0x11a/0x2da
> > > >> > | [<c0c00bb6>] start_kernel+0x3c4/0x6de
> > > >> >
> > > >> > Here, a page with VA 0xfffff000 is a added to the freelist. We were just
> > > >> > lucky (unlucky?) that page was used for the page cache.
> > > >>
> > > >> I just educated myself about memory mapping last night, so the below
> > > >> may be complete nonsense. Take it with a grain of salt.
> > > >>
> > > >> In riscv's setup_bootmem(), we have this line:
> > > >> max_low_pfn = max_pfn = PFN_DOWN(phys_ram_end);
> > > >>
> > > >> I think this is the root cause: max_low_pfn indicates the last page
> > > >> to be mapped. Problem is: nothing prevents PFN_DOWN(phys_ram_end) from
> > > >> getting mapped to the last page (0xfffff000). If max_low_pfn is mapped
> > > >> to the last page, we get the reported problem.
> > > >>
> > > >> There seems to be some code to make sure the last page is not used
> > > >> (the call to memblock_set_current_limit() right above this line). It is
> > > >> unclear to me why this still lets the problem slip through.
> > > >>
> > > >> The fix is simple: never let max_low_pfn gets mapped to the last page.
> > > >> The below patch fixes the problem for me. But I am not entirely sure if
> > > >> this is the correct fix, further investigation needed.
> > > >>
> > > >> Best regards,
> > > >> Nam
> > > >>
> > > >> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> > > >> index fa34cf55037b..17cab0a52726 100644
> > > >> --- a/arch/riscv/mm/init.c
> > > >> +++ b/arch/riscv/mm/init.c
> > > >> @@ -251,7 +251,8 @@ static void __init setup_bootmem(void)
> > > >> }
> > > >>
> > > >> min_low_pfn = PFN_UP(phys_ram_base);
> > > >> - max_low_pfn = max_pfn = PFN_DOWN(phys_ram_end);
> > > >> + max_low_pfn = PFN_DOWN(memblock_get_current_limit());
> > > >> + max_pfn = PFN_DOWN(phys_ram_end);
> > > >> high_memory = (void *)(__va(PFN_PHYS(max_low_pfn)));
> > > >>
> > > >> dma32_phys_limit = min(4UL * SZ_1G, (unsigned long)PFN_PHYS(max_low_pfn));
> > >
> > > Yeah, AFAIU memblock_set_current_limit() only limits the allocation from
> > > memblock. The "forbidden" page (PA 0xc03ff000 VA 0xfffff000) will still
> > > be allowed in the zone.
> > >
> > > I think your patch requires memblock_set_current_limit() is
> > > unconditionally called, which currently is not done.
> > >
> > > The hack I tried was (which seems to work):
> > >
> > > --
> > > diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> > > index fe8e159394d8..3a1f25d41794 100644
> > > --- a/arch/riscv/mm/init.c
> > > +++ b/arch/riscv/mm/init.c
> > > @@ -245,8 +245,10 @@ static void __init setup_bootmem(void)
> > > */
> > > if (!IS_ENABLED(CONFIG_64BIT)) {
> > > max_mapped_addr = __pa(~(ulong)0);
> > > - if (max_mapped_addr == (phys_ram_end - 1))
> > > + if (max_mapped_addr == (phys_ram_end - 1)) {
> > > memblock_set_current_limit(max_mapped_addr - 4096);
> > > + phys_ram_end -= 4096;
> > > + }
> > > }
> >
> > You can just memblock_reserve() the last page of the first gigabyte, e.g.
>
> "last page of the first gigabyte" - why first gigabyte? Do you mean
> last page of *last* gigabyte?

With 3G-1G split linear map can map only 1G from 0xc0000000 to 0xffffffff
(or 0x00000000 with 32-bit overflow):

[ 0.000000] lowmem : 0xc0000000 - 0x00000000 (1024 MB)

> > if (!IS_ENABLED(CONFIG_64BIT)
> > memblock_reserve(SZ_1G - PAGE_SIZE, PAGE_SIZE);
> >
> > The page will still be mapped, but it will never make it to the page
> > allocator.
> >
> > The nice thing about it is, that memblock lets you to reserve regions that are
> > not necessarily populated, so there's no need to check where the actual RAM
> > ends.
>
> I tried the suggested code, it didn't work. I think there are 2
> mistakes:
> - last gigabyte, not first
> - memblock_reserve() takes physical addresses as arguments, not
> virtual addresses
>
> The below patch fixes the problem. Is this what you really meant?
>
> Best regards,
> Nam
>
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index fa34cf55037b..ac7efdd77be8 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -245,6 +245,7 @@ static void __init setup_bootmem(void)
> * be done as soon as the kernel mapping base address is determined.
> */
> if (!IS_ENABLED(CONFIG_64BIT)) {
> + memblock_reserve(__pa(-PAGE_SIZE), __pa(PAGE_SIZE));

__pa(-PAGE_SIZE) is what I meant, yes.

> max_mapped_addr = __pa(~(ulong)0);
> if (max_mapped_addr == (phys_ram_end - 1))
> memblock_set_current_limit(max_mapped_addr - 4096);
>

--
Sincerely yours,
Mike.

2024-04-16 16:33:15

by Mike Rapoport

[permalink] [raw]
Subject: Re: riscv32 EXT4 splat, 6.8 regression?

On Tue, Apr 16, 2024 at 06:19:44PM +0200, Nam Cao wrote:
> On 2024-04-16 Bj?rn T?pel wrote:
> > Nam Cao <[email protected]> writes:
> >
> > > Fixed version:
> > >
> > > diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> > > index fa34cf55037b..af4192bc51d0 100644
> > > --- a/arch/riscv/mm/init.c
> > > +++ b/arch/riscv/mm/init.c
> > > @@ -245,6 +245,7 @@ static void __init setup_bootmem(void)
> > > * be done as soon as the kernel mapping base address is determined.
> > > */
> > > if (!IS_ENABLED(CONFIG_64BIT)) {
> > > + memblock_reserve(__pa(-PAGE_SIZE), PAGE_SIZE);
> > > max_mapped_addr = __pa(~(ulong)0);
> > > if (max_mapped_addr == (phys_ram_end - 1))
> > > memblock_set_current_limit(max_mapped_addr - 4096);
> >
> > Nice!
> >
> > Can't we get rid of the if-statement, and max_mapped_address as well?
>
> I don't see why not :D
>
> Best regards,
> Nam
>
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index fa34cf55037b..f600cfee0aef 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -197,7 +197,6 @@ early_param("mem", early_mem);
> static void __init setup_bootmem(void)
> {
> phys_addr_t vmlinux_end = __pa_symbol(&_end);
> - phys_addr_t max_mapped_addr;
> phys_addr_t phys_ram_end, vmlinux_start;
>
> if (IS_ENABLED(CONFIG_XIP_KERNEL))
> @@ -238,17 +237,9 @@ static void __init setup_bootmem(void)
> /*
> * memblock allocator is not aware of the fact that last 4K bytes of
> * the addressable memory can not be mapped because of IS_ERR_VALUE
> - * macro. Make sure that last 4k bytes are not usable by memblock
> - * if end of dram is equal to maximum addressable memory. For 64-bit
> - * kernel, this problem can't happen here as the end of the virtual
> - * address space is occupied by the kernel mapping then this check must
> - * be done as soon as the kernel mapping base address is determined.
> + * macro. Make sure that last 4k bytes are not usable by memblock.
> */

It's not only memblock, but buddy as well, so maybe

/*
* The last 4K bytes of the addressable memory can not be used
* because of IS_ERR_VALUE macro. Make sure that last 4K bytes are
* not usable by kernel memory allocators.
*/

> - if (!IS_ENABLED(CONFIG_64BIT)) {
> - max_mapped_addr = __pa(~(ulong)0);
> - if (max_mapped_addr == (phys_ram_end - 1))
> - memblock_set_current_limit(max_mapped_addr - 4096);
> - }
> + memblock_reserve(__pa(-PAGE_SIZE), PAGE_SIZE);

Ack.

> min_low_pfn = PFN_UP(phys_ram_base);
> max_low_pfn = max_pfn = PFN_DOWN(phys_ram_end);
>

--
Sincerely yours,
Mike.

2024-04-16 17:00:48

by Matthew Wilcox

[permalink] [raw]
Subject: Re: riscv32 EXT4 splat, 6.8 regression?

On Tue, Apr 16, 2024 at 07:31:54PM +0300, Mike Rapoport wrote:
> > @@ -238,17 +237,9 @@ static void __init setup_bootmem(void)
> > /*
> > * memblock allocator is not aware of the fact that last 4K bytes of
> > * the addressable memory can not be mapped because of IS_ERR_VALUE
> > - * macro. Make sure that last 4k bytes are not usable by memblock
> > - * if end of dram is equal to maximum addressable memory. For 64-bit
> > - * kernel, this problem can't happen here as the end of the virtual
> > - * address space is occupied by the kernel mapping then this check must
> > - * be done as soon as the kernel mapping base address is determined.
> > + * macro. Make sure that last 4k bytes are not usable by memblock.
> > */
>
> It's not only memblock, but buddy as well, so maybe
>
> /*
> * The last 4K bytes of the addressable memory can not be used
> * because of IS_ERR_VALUE macro. Make sure that last 4K bytes are
> * not usable by kernel memory allocators.
> */
>
> > - if (!IS_ENABLED(CONFIG_64BIT)) {
> > - max_mapped_addr = __pa(~(ulong)0);
> > - if (max_mapped_addr == (phys_ram_end - 1))
> > - memblock_set_current_limit(max_mapped_addr - 4096);
> > - }
> > + memblock_reserve(__pa(-PAGE_SIZE), PAGE_SIZE);
>
> Ack.

Can this go to generic code instead of letting architecture maintainers
fall over it?

2024-04-16 18:05:50

by Björn Töpel

[permalink] [raw]
Subject: Re: riscv32 EXT4 splat, 6.8 regression?

Nam Cao <[email protected]> writes:

> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index fa34cf55037b..f600cfee0aef 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -197,7 +197,6 @@ early_param("mem", early_mem);
> static void __init setup_bootmem(void)
> {
> phys_addr_t vmlinux_end = __pa_symbol(&_end);
> - phys_addr_t max_mapped_addr;
> phys_addr_t phys_ram_end, vmlinux_start;
>
> if (IS_ENABLED(CONFIG_XIP_KERNEL))
> @@ -238,17 +237,9 @@ static void __init setup_bootmem(void)
> /*
> * memblock allocator is not aware of the fact that last 4K bytes of
> * the addressable memory can not be mapped because of IS_ERR_VALUE
> - * macro. Make sure that last 4k bytes are not usable by memblock
> - * if end of dram is equal to maximum addressable memory. For 64-bit
> - * kernel, this problem can't happen here as the end of the virtual
> - * address space is occupied by the kernel mapping then this check must
> - * be done as soon as the kernel mapping base address is determined.
> + * macro. Make sure that last 4k bytes are not usable by memblock.
> */
> - if (!IS_ENABLED(CONFIG_64BIT)) {
> - max_mapped_addr = __pa(~(ulong)0);
> - if (max_mapped_addr == (phys_ram_end - 1))
> - memblock_set_current_limit(max_mapped_addr - 4096);
> - }
> + memblock_reserve(__pa(-PAGE_SIZE), PAGE_SIZE);
>
> min_low_pfn = PFN_UP(phys_ram_base);
> max_low_pfn = max_pfn = PFN_DOWN(phys_ram_end);

Nice! Would you mind submitting this as a proper fix (unless there's a
way to do it non-arch specific like Matthew pointed out).


Björn

2024-04-16 18:09:26

by Nam Cao

[permalink] [raw]
Subject: Re: riscv32 EXT4 splat, 6.8 regression?

On 2024-04-16 Björn Töpel wrote:
> Nam Cao <[email protected]> writes:
>
> > diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> > index fa34cf55037b..f600cfee0aef 100644
> > --- a/arch/riscv/mm/init.c
> > +++ b/arch/riscv/mm/init.c
> > @@ -197,7 +197,6 @@ early_param("mem", early_mem);
> > static void __init setup_bootmem(void)
> > {
> > phys_addr_t vmlinux_end = __pa_symbol(&_end);
> > - phys_addr_t max_mapped_addr;
> > phys_addr_t phys_ram_end, vmlinux_start;
> >
> > if (IS_ENABLED(CONFIG_XIP_KERNEL))
> > @@ -238,17 +237,9 @@ static void __init setup_bootmem(void)
> > /*
> > * memblock allocator is not aware of the fact that last 4K bytes of
> > * the addressable memory can not be mapped because of IS_ERR_VALUE
> > - * macro. Make sure that last 4k bytes are not usable by memblock
> > - * if end of dram is equal to maximum addressable memory. For 64-bit
> > - * kernel, this problem can't happen here as the end of the virtual
> > - * address space is occupied by the kernel mapping then this check must
> > - * be done as soon as the kernel mapping base address is determined.
> > + * macro. Make sure that last 4k bytes are not usable by memblock.
> > */
> > - if (!IS_ENABLED(CONFIG_64BIT)) {
> > - max_mapped_addr = __pa(~(ulong)0);
> > - if (max_mapped_addr == (phys_ram_end - 1))
> > - memblock_set_current_limit(max_mapped_addr - 4096);
> > - }
> > + memblock_reserve(__pa(-PAGE_SIZE), PAGE_SIZE);
> >
> > min_low_pfn = PFN_UP(phys_ram_base);
> > max_low_pfn = max_pfn = PFN_DOWN(phys_ram_end);
>
> Nice! Would you mind submitting this as a proper fix (unless there's a
> way to do it non-arch specific like Matthew pointed out).

I don't mind, but I am waiting for the discussion on the non-arch solution.

Best regards,
Nam

2024-04-16 18:19:29

by Mike Rapoport

[permalink] [raw]
Subject: Re: riscv32 EXT4 splat, 6.8 regression?

On Tue, Apr 16, 2024 at 05:31:51PM +0100, Matthew Wilcox wrote:
> On Tue, Apr 16, 2024 at 07:19:27PM +0300, Mike Rapoport wrote:
> > > "last page of the first gigabyte" - why first gigabyte? Do you mean
> > > last page of *last* gigabyte?
> >
> > With 3G-1G split linear map can map only 1G from 0xc0000000 to 0xffffffff
> > (or 0x00000000 with 32-bit overflow):
> >
> > [ 0.000000] lowmem : 0xc0000000 - 0x00000000 (1024 MB)
>
> ... but you can't map that much. You need to reserve space for (may not
> be exhaustive):
>
> - PCI BARs (or other MMIO)
> - vmap
> - kmap
> - percpu
> - ioremap
> - modules
> - fixmap
> - Maybe EFI runtime services?
>
> You'll be lucky to get 800MB of ZONE_NORMAL.

But that does not mean that the last page won't get to the buddy

--
Sincerely yours,
Mike.

2024-04-16 18:36:09

by Mike Rapoport

[permalink] [raw]
Subject: Re: riscv32 EXT4 splat, 6.8 regression?

On Tue, Apr 16, 2024 at 06:00:29PM +0100, Matthew Wilcox wrote:
> On Tue, Apr 16, 2024 at 07:31:54PM +0300, Mike Rapoport wrote:
> > > @@ -238,17 +237,9 @@ static void __init setup_bootmem(void)
> > > /*
> > > * memblock allocator is not aware of the fact that last 4K bytes of
> > > * the addressable memory can not be mapped because of IS_ERR_VALUE
> > > - * macro. Make sure that last 4k bytes are not usable by memblock
> > > - * if end of dram is equal to maximum addressable memory. For 64-bit
> > > - * kernel, this problem can't happen here as the end of the virtual
> > > - * address space is occupied by the kernel mapping then this check must
> > > - * be done as soon as the kernel mapping base address is determined.
> > > + * macro. Make sure that last 4k bytes are not usable by memblock.
> > > */
> >
> > It's not only memblock, but buddy as well, so maybe
> >
> > /*
> > * The last 4K bytes of the addressable memory can not be used
> > * because of IS_ERR_VALUE macro. Make sure that last 4K bytes are
> > * not usable by kernel memory allocators.
> > */
> >
> > > - if (!IS_ENABLED(CONFIG_64BIT)) {
> > > - max_mapped_addr = __pa(~(ulong)0);
> > > - if (max_mapped_addr == (phys_ram_end - 1))
> > > - memblock_set_current_limit(max_mapped_addr - 4096);
> > > - }
> > > + memblock_reserve(__pa(-PAGE_SIZE), PAGE_SIZE);
> >
> > Ack.
>
> Can this go to generic code instead of letting architecture maintainers
> fall over it?

Yes, it's just have to happen before setup_arch() where most architectures
enable memblock allocations.

--
Sincerely yours,
Mike.

2024-04-17 18:06:31

by Nam Cao

[permalink] [raw]
Subject: Re: riscv32 EXT4 splat, 6.8 regression?

On 2024-04-17 Theodore Ts'o wrote:
> On Wed, Apr 17, 2024 at 12:36:39AM +0200, Nam Cao wrote:
> >
> > However, I am confused about one thing: doesn't this make one page of
> > physical memory inaccessible?
>
> So are these riscv32 systems really having multiple terabytes of
> memory? Why is this page in the physical memory map in the first
> place?

It's 32 bit, so it doesn't take much to fill up the entire address space.

Here's the memory layout from kernel boot log:

[ 0.000000] Virtual kernel memory layout:
[ 0.000000] fixmap : 0x9c800000 - 0x9d000000 (8192 kB)
[ 0.000000] pci io : 0x9d000000 - 0x9e000000 ( 16 MB)
[ 0.000000] vmemmap : 0x9e000000 - 0xa0000000 ( 32 MB)
[ 0.000000] vmalloc : 0xa0000000 - 0xc0000000 ( 512 MB)
[ 0.000000] lowmem : 0xc0000000 - 0x00000000 (1024 MB)

Note that lowmem occupies the last 1GB, including ERR_PTR (the last
address wraps to zero)

Best regards,
Nam

2024-04-17 19:35:48

by Mike Rapoport

[permalink] [raw]
Subject: Re: riscv32 EXT4 splat, 6.8 regression?

On Wed, Apr 17, 2024 at 12:36:39AM +0200, Nam Cao wrote:
> On 2024-04-16 Mike Rapoport wrote:
> > On Tue, Apr 16, 2024 at 06:00:29PM +0100, Matthew Wilcox wrote:
> > > On Tue, Apr 16, 2024 at 07:31:54PM +0300, Mike Rapoport wrote:
> > > > > - if (!IS_ENABLED(CONFIG_64BIT)) {
> > > > > - max_mapped_addr = __pa(~(ulong)0);
> > > > > - if (max_mapped_addr == (phys_ram_end - 1))
> > > > > - memblock_set_current_limit(max_mapped_addr - 4096);
> > > > > - }
> > > > > + memblock_reserve(__pa(-PAGE_SIZE), PAGE_SIZE);
> > > >
> > > > Ack.
> > >
> > > Can this go to generic code instead of letting architecture maintainers
> > > fall over it?
> >
> > Yes, it's just have to happen before setup_arch() where most architectures
> > enable memblock allocations.
>
> This also works, the reported problem disappears.
>
> However, I am confused about one thing: doesn't this make one page of
> physical memory inaccessible?
>
> Is it better to solve this by setting max_low_pfn instead? Then at
> least the page is still accessible as high memory.

It could be if riscv had support for HIGHMEM.

> Best regards,
> Nam
>
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index fa34cf55037b..6e3130cae675 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -197,7 +197,6 @@ early_param("mem", early_mem);
> static void __init setup_bootmem(void)
> {
> phys_addr_t vmlinux_end = __pa_symbol(&_end);
> - phys_addr_t max_mapped_addr;
> phys_addr_t phys_ram_end, vmlinux_start;
>
> if (IS_ENABLED(CONFIG_XIP_KERNEL))
> @@ -235,23 +234,9 @@ static void __init setup_bootmem(void)
> if (IS_ENABLED(CONFIG_64BIT))
> kernel_map.va_pa_offset = PAGE_OFFSET - phys_ram_base;
>
> - /*
> - * memblock allocator is not aware of the fact that last 4K bytes of
> - * the addressable memory can not be mapped because of IS_ERR_VALUE
> - * macro. Make sure that last 4k bytes are not usable by memblock
> - * if end of dram is equal to maximum addressable memory. For 64-bit
> - * kernel, this problem can't happen here as the end of the virtual
> - * address space is occupied by the kernel mapping then this check must
> - * be done as soon as the kernel mapping base address is determined.
> - */
> - if (!IS_ENABLED(CONFIG_64BIT)) {
> - max_mapped_addr = __pa(~(ulong)0);
> - if (max_mapped_addr == (phys_ram_end - 1))
> - memblock_set_current_limit(max_mapped_addr - 4096);

To be precisely strict about the conflict between mapping a page at
0xfffff000 and IS_ERR_VALUE, memblock should not allocate the that page, so
memblock_set_current_limit() should remain. It does not need all the
surrounding if, though just setting the limit for -PAGE_SIZE should do.

Although I suspect that this call to memblock_set_current_limit() is what
caused the splat in ext4. Without that limit enforcement, the last page
would be the first one memblock allocates and it most likely would have
ended in the kernel page tables and would never be checked for IS_ERR. With
the limit set that page made it to the buddy and got allocated by the code
that actually does IS_ERR checks.

> - }
> -
> min_low_pfn = PFN_UP(phys_ram_base);
> - max_low_pfn = max_pfn = PFN_DOWN(phys_ram_end);
> + max_pfn = PFN_DOWN(phys_ram_end);
> + max_low_pfn = min(max_pfn, PFN_DOWN(__pa(-PAGE_SIZE)));
> high_memory = (void *)(__va(PFN_PHYS(max_low_pfn)));
>
> dma32_phys_limit = min(4UL * SZ_1G, (unsigned long)PFN_PHYS(max_low_pfn));

--
Sincerely yours,
Mike.

2024-04-17 22:10:52

by Andreas Dilger

[permalink] [raw]
Subject: Re: riscv32 EXT4 splat, 6.8 regression?

On Apr 16, 2024, at 4:36 PM, Nam Cao <[email protected]> wrote:
>
> On 2024-04-16 Mike Rapoport wrote:
>> On Tue, Apr 16, 2024 at 06:00:29PM +0100, Matthew Wilcox wrote:
>>> On Tue, Apr 16, 2024 at 07:31:54PM +0300, Mike Rapoport wrote:
>>>>> - if (!IS_ENABLED(CONFIG_64BIT)) {
>>>>> - max_mapped_addr = __pa(~(ulong)0);
>>>>> - if (max_mapped_addr == (phys_ram_end - 1))
>>>>> - memblock_set_current_limit(max_mapped_addr - 4096);
>>>>> - }
>>>>> + memblock_reserve(__pa(-PAGE_SIZE), PAGE_SIZE);
>>>>
>>>> Ack.
>>>
>>> Can this go to generic code instead of letting architecture maintainers
>>> fall over it?
>>
>> Yes, it's just have to happen before setup_arch() where most architectures
>> enable memblock allocations.
>
> This also works, the reported problem disappears.
>
> However, I am confused about one thing: doesn't this make one page of
> physical memory inaccessible?
>
> Is it better to solve this by setting max_low_pfn instead? Then at
> least the page is still accessible as high memory.

Is that one page of memory really worthwhile to preserve? Better to
have a simple solution that works, maybe even mapping that page
read-only so that any code which tries to dereference an ERR_PTR
address immediately gets a fault?

Cheers, Andreas

>
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index fa34cf55037b..6e3130cae675 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -197,7 +197,6 @@ early_param("mem", early_mem);
> static void __init setup_bootmem(void)
> {
> phys_addr_t vmlinux_end = __pa_symbol(&_end);
> - phys_addr_t max_mapped_addr;
> phys_addr_t phys_ram_end, vmlinux_start;
>
> if (IS_ENABLED(CONFIG_XIP_KERNEL))
> @@ -235,23 +234,9 @@ static void __init setup_bootmem(void)
> if (IS_ENABLED(CONFIG_64BIT))
> kernel_map.va_pa_offset = PAGE_OFFSET - phys_ram_base;
>
> - /*
> - * memblock allocator is not aware of the fact that last 4K bytes of
> - * the addressable memory can not be mapped because of IS_ERR_VALUE
> - * macro. Make sure that last 4k bytes are not usable by memblock
> - * if end of dram is equal to maximum addressable memory. For 64-bit
> - * kernel, this problem can't happen here as the end of the virtual
> - * address space is occupied by the kernel mapping then this check must
> - * be done as soon as the kernel mapping base address is determined.
> - */
> - if (!IS_ENABLED(CONFIG_64BIT)) {
> - max_mapped_addr = __pa(~(ulong)0);
> - if (max_mapped_addr == (phys_ram_end - 1))
> - memblock_set_current_limit(max_mapped_addr - 4096);
> - }
> -
> min_low_pfn = PFN_UP(phys_ram_base);
> - max_low_pfn = max_pfn = PFN_DOWN(phys_ram_end);
> + max_pfn = PFN_DOWN(phys_ram_end);
> + max_low_pfn = min(max_pfn, PFN_DOWN(__pa(-PAGE_SIZE)));
> high_memory = (void *)(__va(PFN_PHYS(max_low_pfn)));
>
> dma32_phys_limit = min(4UL * SZ_1G, (unsigned long)PFN_PHYS(max_low_pfn));


Cheers, Andreas






Attachments:
signature.asc (890.00 B)
Message signed with OpenPGP

2024-04-18 09:18:07

by Nam Cao

[permalink] [raw]
Subject: Re: riscv32 EXT4 splat, 6.8 regression?

On 2024-04-17 Andreas Dilger wrote:
> On Apr 16, 2024, at 4:36 PM, Nam Cao <[email protected]> wrote:
> > However, I am confused about one thing: doesn't this make one page of
> > physical memory inaccessible?
> >
> > Is it better to solve this by setting max_low_pfn instead? Then at
> > least the page is still accessible as high memory.
>
> Is that one page of memory really worthwhile to preserve? Better to
> have a simple solution that works,

Good point.

> maybe even mapping that page
> read-only so that any code which tries to dereference an ERR_PTR
> address immediately gets a fault?

Not sure about this part: it doesn't really fix the problem, just
changes from subtle crashes into page faults.

Let me send a patch to reserve the page: simple and works for all
architectures.

Best regards,
Nam