2021-11-04 11:54:57

by David Sterba

[permalink] [raw]
Subject: Kmap-related crashes and memory leaks on 32bit arch (5.15+)

Hi,

while validating the kmap related fixes on 32bit we've started to hit crashes
and memory leaks. The suspicion was that the btrfs fix was incomplete or buggy
but by code inspection there's nothing obviously wrong.

The crash I've hit:

[ 538.853152] BTRFS: device fsid 26fb6ae8-1679-4ed8-a5b5-57b5acdd1ab8 devid 1 transid 5 /dev/vda scanned by mkfs.btrfs (117)
[ 538.868760] BTRFS info (device vda): flagging fs with big metadata feature
[ 538.870143] BTRFS info (device vda): setting incompat feature flag for COMPRESS_LZO (0x8)
[ 538.871799] BTRFS info (device vda): force lzo compression, level 0
[ 538.872990] BTRFS info (device vda): disk space caching is enabled
[ 538.874186] BTRFS info (device vda): has skinny extents
[ 538.880477] BTRFS info (device vda): checking UUID tree
[ 588.203222] BUG: unable to handle page fault for address: 01460012
[ 588.205004] #PF: supervisor read access in kernel mode
[ 588.206496] #PF: error_code(0x0000) - not-present page
[ 588.208221] *pde = 00000000
[ 588.209178] Oops: 0000 [#1] PREEMPT SMP
[ 588.210424] CPU: 0 PID: 143 Comm: kworker/u4:5 Not tainted 5.15.0-default+ #4
[ 588.212705] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a-rebuilt.opensuse.org 04/01/2014
[ 588.216108] Workqueue: btrfs-delalloc btrfs_work_helper [btrfs]
[ 588.218077] EIP: copy_compressed_data_to_page+0x14a/0x380 [btrfs]
[ 588.225384] EAX: 00000000 EBX: 01460012 ECX: 00000000 EDX: 00000000
[ 588.227149] ESI: 00001000 EDI: d3f3f200 EBP: c3767d90 ESP: c3767d64
[ 588.228871] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010246
[ 588.230722] CR0: 80050033 CR2: 01460012 CR3: 08f6d000 CR4: 00150e90
[ 588.232380] Call Trace:
[ 588.233099] lzo_compress_pages+0x148/0x3a0 [btrfs]
[ 588.234492] btrfs_compress_pages+0xc2/0x160 [btrfs]
[ 588.235429] compress_file_range+0x455/0xae0 [btrfs]
[ 588.235909] async_cow_start+0x13/0x40 [btrfs]
[ 588.236344] btrfs_work_helper+0xcb/0x1e0 [btrfs]
[ 588.236809] ? submit_compressed_extents+0xa0/0xa0 [btrfs]
[ 588.237320] process_one_work+0x23c/0x580
[ 588.238015] ? process_one_work+0x1ac/0x580
[ 588.238857] ? worker_thread+0x75/0x3f0
[ 588.239712] ? submit_compressed_extents+0xa0/0xa0 [btrfs]
[ 588.240836] worker_thread+0x16e/0x3f0
[ 588.241607] kthread+0x144/0x170
[ 588.242283] ? process_one_work+0x580/0x580
[ 588.242999] ? set_kthread_struct+0x40/0x40
[ 588.243357] ret_from_fork+0x19/0x24
[ 588.243654] Modules linked in: btrfs blake2b_generic libcrc32c crc32c_intel xor zstd_decompress zstd_compress xxhash lzo_compress lzo_decompress raid6_pq
[ 588.244621] CR2: 0000000001460012
[ 588.244900] ---[ end trace 76c69d1a25a61f8d ]---
[ 588.245259] EIP: copy_compressed_data_to_page+0x14a/0x380 [btrfs]
[ 588.248681] EAX: 00000000 EBX: 01460012 ECX: 00000000 EDX: 00000000
[ 588.249516] ESI: 00001000 EDI: d3f3f200 EBP: c3767d90 ESP: c3767d64
[ 588.250515] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010246
[ 588.251469] CR0: 80050033 CR2: 01460012 CR3: 08f6d000 CR4: 00150e90
[ 588.252238] BUG: sleeping function called from invalid context at include/linux/percpu-rwsem.h:49
[ 588.253349] in_atomic(): 0, irqs_disabled(): 1, non_block: 0, pid: 143, name: kworker/u4:5
[ 588.254498] preempt_count: 0, expected: 0
[ 588.255081] RCU nest depth: 0, expected: 0
[ 588.255651] INFO: lockdep is turned off.
[ 588.256225] irq event stamp: 508952
[ 588.256724] hardirqs last enabled at (508951): [<db21cca7>] rmqueue.isra.0+0x3e7/0xa30
[ 588.257773] hardirqs last disabled at (508952): [<db7d1616>] exc_page_fault+0x36/0x200
[ 588.258823] softirqs last enabled at (508546): [<db7df182>] __do_softirq+0x2c2/0x51d
[ 588.259987] softirqs last disabled at (508541): [<db02c827>] do_softirq_own_stack+0x37/0x50
[ 588.261434] CPU: 0 PID: 143 Comm: kworker/u4:5 Tainted: G D 5.15.0-default+ #4
[ 588.262806] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a-rebuilt.opensuse.org 04/01/2014
[ 588.264455] Workqueue: btrfs-delalloc btrfs_work_helper [btrfs]
[ 588.265269] Call Trace:
[ 588.265643] ? show_stack+0x3d/0x45
[ 588.266180] dump_stack_lvl+0x44/0x57
[ 588.266714] dump_stack+0xd/0x10
[ 588.267200] __might_resched.cold+0x15f/0x189
[ 588.267775] ? wake_up_klogd+0x3a/0xa0
[ 588.268288] __might_sleep+0x4f/0xa0
[ 588.268824] exit_signals+0x1f/0x380
[ 588.269349] ? worker_thread+0x16e/0x3f0
[ 588.269886] do_exit+0x94/0x430
[ 588.270337] ? kthread+0x144/0x170
[ 588.270831] ? process_one_work+0x580/0x580
[ 588.271402] rewind_stack_do_exit+0x11/0x13

Reproducer:

- enable lzo on btrfs
- copy files with compressible contents (/usr was sufficient to trigger it)

$ cat test-lzo
#!/bin/sh

modprobe btrfs
mkfs.btrfs /dev/vda
mount -o compress-force=lzo /dev/vda /mnt

cd /mnt
for i in `seq 5`; do
cp -a /usr $RANDOM
dd if=/dev/zero of=zerofill$RANDOM bs=1M count=40000
sync
df -h .
done
---

I've was mainly interested to validate v5.15 + 2cf3f8133bda ("btrfs: fix
lzo_decompress_bio() kmap leakage") as this would end up in stable tree.
This does not trigger the crash nor the leaks.

We were testing master branch commit of the day and that showed up the
problems. The folios were my first suspicion so I tested that
separately. There's a fixup needed, the tested branch was
49f8275c7d9247cf + e66435936756d9bc, no crash or leak.

The rest is about 410 merge commits to the point we started testing so
I'd rather get some hint where to start bisecting as it would probably
need to be manual. The top commit was dcd68326d29b62f303.

I have enabled various debugging config options plus some relevant to
kmap/highmem:

# CONFIG_NOHIGHMEM is not set
CONFIG_HIGHMEM4G=y
# CONFIG_HIGHMEM64G is not set
CONFIG_HIGHMEM=y
CONFIG_KMAP_LOCAL=y
# CONFIG_INPUT_SPARSEKMAP is not set
# CONFIG_INPUT_MATRIXKMAP is not set
CONFIG_DEBUG_KMAP_LOCAL=y
CONFIG_ARCH_SUPPORTS_KMAP_LOCAL_FORCE_MAP=y
CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP=y
CONFIG_DEBUG_HIGHMEM=y

The test machine is qemu (i386) VM with 2G or memory and 2 CPUs.

For the memory leak part, that's what Qu Wenruo observed when testing
fstests, I don't have exact details here, but basically used memory was
growing after the workload even after unloading btrfs module or dropping
caches.


2021-11-04 23:40:34

by Linus Torvalds

[permalink] [raw]
Subject: Re: Kmap-related crashes and memory leaks on 32bit arch (5.15+)

On Thu, Nov 4, 2021 at 3:09 PM Linus Torvalds
<[email protected]> wrote:
>
> I'm obviously not on my laptop right now, but I did look at the btrfs
> lzo code earlier today again, and didn't see anything that looked
> remotely suspicious.

Ok, back at the laptop, and once again looking at this.

I'm stumped.

So if I understand correctly,

5.15 + 2cf3f8133bda ("btrfs: fix lzo_decompress_bio() kmap leakage")

is fine.

Also, 5.15 with the folio merge, plus the fix for that (commit
e66435936756: "mm: fix mismerge of folio page flag manipulators") is
fine too. I

The tested "tip of the day" that was bad was dcd68326d29b ("Merge tag
'devicetree-for-5.16' of git://...").

Since you already tested the folio merge, there really isn't a whole
lot of mm changes left in there. Andrew hasn't sent his patch-bombs
yet.

Doing a

gitk 49f8275c7d9247cf..037c50bfbeb33b \
mm/ include/linux/highmem* fs/btrfs/

really doesn't show anything that looks particularly suspicious.
There's some sync_bdev() changes, and some polling changes, but they
look trivial.

The only half-way relevant thing that remains is my merge, which very
much had conflicts around kmap/kunmap because of the revert problems.

So I continue to think that I must have screwed up, but I still don't
see which kmap/kunmap would be wrong.

I'll just repeat my suggestion here since the original email didn't go
to the lists.

> (a) test your side before my merge with your alternate kmap fix
> commit (the one you had in the other branch to make it allegedly
> easier to resolve)?
>
> (b) if that works, re-do the merge using that kmap pattern?

Your kmap() pattern is slightly different from mine. I tried to avoid
an unnecessary "kmap again" in copy_compressed_data_to_page(), so my
kmap lifetime is slightly longer over that loop.

Having looked at it once more, it still looks "ObviouslyCorrect(tm)"
to me, but I suspect I'm just being blind to some obvious bug.

> If (a) works, but (b) still fails, then it must be some odd
> interaction issue with something else. Which sounds unlikely, since I
> don't think we really had anything that should affect kmap or anything
> in this area, but who knows...

And bisection ends up perhaps somewhat painful, but sounds like the
way to go if there's no other path forward.

Linus

2021-11-04 23:51:23

by Linus Torvalds

[permalink] [raw]
Subject: Re: Kmap-related crashes and memory leaks on 32bit arch (5.15+)

On Thu, Nov 4, 2021 at 4:37 PM Linus Torvalds
<[email protected]> wrote:
>
> Having looked at it once more, it still looks "ObviouslyCorrect(tm)"
> to me, but I suspect I'm just being blind to some obvious bug.

Oh, I was just looking at the pattern of kmap/kunmap, but there does
seem to be a questionable pattern outside of that:

This pattern looks odd:

kaddr = kmap(cur_page);
write_compress_length(kaddr + offset_in_page(*cur_out),
compressed_size);
...

(and then whether you kunmap immediately, or you leave it kmap'ed and
use it again at the end is a different issue)

In particular, what if "offset_in_page(*cur_out)" is very close to the
end of the page?

That write_compress_length() always writes out a word-sized length (ie
LZO_LEN bytes), and the above pattern seems to have no model to handle
the "oh, this 4-byte write crosses a page boundary"

The other writes in that function seem to do it properly, and you have

u32 copy_len = min_t(u32, sectorsize - *cur_out % sectorsize,
orig_out + compressed_size - *cur_out);

so doing the memcpy() of size 'copy_len' should never cross a page
boundary as long as sectorsize is a power-of-2 smaller or equal to a
page size. But those 4-byte length writes seem like they could be page
crossers.

The same situation exists on the reading side, I think.

Maybe there's some reason why the read/write_compress_length() can
never cross a page boundary, but it did strike me as odd.

Linus

2021-11-05 00:04:23

by Qu Wenruo

[permalink] [raw]
Subject: Re: Kmap-related crashes and memory leaks on 32bit arch (5.15+)



On 2021/11/5 07:48, Linus Torvalds wrote:
> On Thu, Nov 4, 2021 at 4:37 PM Linus Torvalds
> <[email protected]> wrote:
>>
>> Having looked at it once more, it still looks "ObviouslyCorrect(tm)"
>> to me, but I suspect I'm just being blind to some obvious bug.
>
> Oh, I was just looking at the pattern of kmap/kunmap, but there does
> seem to be a questionable pattern outside of that:
>
> This pattern looks odd:
>
> kaddr = kmap(cur_page);
> write_compress_length(kaddr + offset_in_page(*cur_out),
> compressed_size);
> ...
>
> (and then whether you kunmap immediately, or you leave it kmap'ed and
> use it again at the end is a different issue)

That part is paired with the the following code, to prevent we cross
page boundary for the segment header:

/*
* Check if we can fit the next segment header into the remaining space
* of the sector.
*/
sector_bytes_left = round_up(*cur_out, sectorsize) - *cur_out;
if (sector_bytes_left >= LZO_LEN || sector_bytes_left == 0)
goto out;

/* The remaining size is not enough, pad it with zeros */
memset(kaddr + offset_in_page(*cur_out), 0,
sector_bytes_left);
*cur_out += sector_bytes_left;


So we always ensure that each segment header never crosses the page
boundary.

This behavior is a little tricky but is part of the on-disk format for
lzo compressed data.


BTW, I also thought that part can be suspicious, as it keeps the page
mapped (unlike all other call sites), thus I tried the following diff,
but no difference for the leakage:

diff --git a/fs/btrfs/lzo.c b/fs/btrfs/lzo.c
index 65cb0766e62d..0648acc48310 100644
--- a/fs/btrfs/lzo.c
+++ b/fs/btrfs/lzo.c
@@ -151,6 +151,7 @@ static int copy_compressed_data_to_page(char
*compressed_data,
kaddr = kmap(cur_page);
write_compress_length(kaddr + offset_in_page(*cur_out),
compressed_size);
+ kunmap(cur_page);
*cur_out += LZO_LEN;

orig_out = *cur_out;
@@ -160,7 +161,6 @@ static int copy_compressed_data_to_page(char
*compressed_data,
u32 copy_len = min_t(u32, sectorsize - *cur_out % sectorsize,
orig_out + compressed_size - *cur_out);

- kunmap(cur_page);
cur_page = out_pages[*cur_out / PAGE_SIZE];
/* Allocate a new page */
if (!cur_page) {
@@ -173,6 +173,7 @@ static int copy_compressed_data_to_page(char
*compressed_data,

memcpy(kaddr + offset_in_page(*cur_out),
compressed_data + *cur_out - orig_out, copy_len);
+ kunmap(cur_page);

*cur_out += copy_len;
}
@@ -186,12 +187,15 @@ static int copy_compressed_data_to_page(char
*compressed_data,
goto out;

/* The remaining size is not enough, pad it with zeros */
+ cur_page = out_pages[*cur_out / PAGE_SIZE];
+ ASSERT(cur_page);
+ kmap(cur_page);
memset(kaddr + offset_in_page(*cur_out), 0,
sector_bytes_left);
+ kunmap(cur_page);
*cur_out += sector_bytes_left;

out:
- kunmap(cur_page);
return 0;
}

Thanks,
Qu
>
> In particular, what if "offset_in_page(*cur_out)" is very close to the
> end of the page?
>
> That write_compress_length() always writes out a word-sized length (ie
> LZO_LEN bytes), and the above pattern seems to have no model to handle
> the "oh, this 4-byte write crosses a page boundary"
>
> The other writes in that function seem to do it properly, and you have
>
> u32 copy_len = min_t(u32, sectorsize - *cur_out % sectorsize,
> orig_out + compressed_size - *cur_out);
>
> so doing the memcpy() of size 'copy_len' should never cross a page
> boundary as long as sectorsize is a power-of-2 smaller or equal to a
> page size. But those 4-byte length writes seem like they could be page
> crossers.
>
> The same situation exists on the reading side, I think.
>
> Maybe there's some reason why the read/write_compress_length() can
> never cross a page boundary, but it did strike me as odd.
>
> Linus
>

2021-11-05 16:15:15

by Ira Weiny

[permalink] [raw]
Subject: Re: Kmap-related crashes and memory leaks on 32bit arch (5.15+)

On Fri, Nov 05, 2021 at 08:01:13AM +0800, Qu Wenruo wrote:
>

[snip]

>
>
> BTW, I also thought that part can be suspicious, as it keeps the page mapped
> (unlike all other call sites), thus I tried the following diff, but no
> difference for the leakage:

I just saw this thread and I was wondering why can't kmap_local_page() be used?

I know we are trying to remove highmem from the kernel but the DAX stray write
protection I've been working on depends on the kmap interface to ensure that
DAX pages are properly accessed.[1] Also there are a couple of new helpers
which could be used instead of the changes below.

I know this does not solve the problem Linus is seeing and DAX is not yet
supported on btrfs but I know there has been some effort to get it working and
things like commit ...

8c945d32e604 ("btrfs: compression: drop kmap/kunmap from lzo")

... would break that support.

>
> diff --git a/fs/btrfs/lzo.c b/fs/btrfs/lzo.c
> index 65cb0766e62d..0648acc48310 100644
> --- a/fs/btrfs/lzo.c
> +++ b/fs/btrfs/lzo.c
> @@ -151,6 +151,7 @@ static int copy_compressed_data_to_page(char
> *compressed_data,
> kaddr = kmap(cur_page);
> write_compress_length(kaddr + offset_in_page(*cur_out),
> compressed_size);
> + kunmap(cur_page);
> *cur_out += LZO_LEN;
>
> orig_out = *cur_out;
> @@ -160,7 +161,6 @@ static int copy_compressed_data_to_page(char
> *compressed_data,
> u32 copy_len = min_t(u32, sectorsize - *cur_out % sectorsize,
> orig_out + compressed_size - *cur_out);
>
> - kunmap(cur_page);
> cur_page = out_pages[*cur_out / PAGE_SIZE];
> /* Allocate a new page */
> if (!cur_page) {
> @@ -173,6 +173,7 @@ static int copy_compressed_data_to_page(char
> *compressed_data,
>
> memcpy(kaddr + offset_in_page(*cur_out),
> compressed_data + *cur_out - orig_out, copy_len);
> + kunmap(cur_page);

memcpy_to_page()?

>
> *cur_out += copy_len;
> }
> @@ -186,12 +187,15 @@ static int copy_compressed_data_to_page(char
> *compressed_data,
> goto out;
>
> /* The remaining size is not enough, pad it with zeros */
> + cur_page = out_pages[*cur_out / PAGE_SIZE];
> + ASSERT(cur_page);
> + kmap(cur_page);
> memset(kaddr + offset_in_page(*cur_out), 0,
> sector_bytes_left);
> + kunmap(cur_page);

memzero_page()?

Just my $0.02 given I've been trying to remove kmap() uses and btrfs remains
one of the big users of kmap().

Thanks,
Ira

[1] https://lore.kernel.org/lkml/[email protected]/

> *cur_out += sector_bytes_left;
>
> out:
> - kunmap(cur_page);
> return 0;
> }
>
> Thanks,
> Qu
> >
> > In particular, what if "offset_in_page(*cur_out)" is very close to the
> > end of the page?
> >
> > That write_compress_length() always writes out a word-sized length (ie
> > LZO_LEN bytes), and the above pattern seems to have no model to handle
> > the "oh, this 4-byte write crosses a page boundary"
> >
> > The other writes in that function seem to do it properly, and you have
> >
> > u32 copy_len = min_t(u32, sectorsize - *cur_out % sectorsize,
> > orig_out + compressed_size - *cur_out);
> >
> > so doing the memcpy() of size 'copy_len' should never cross a page
> > boundary as long as sectorsize is a power-of-2 smaller or equal to a
> > page size. But those 4-byte length writes seem like they could be page
> > crossers.
> >
> > The same situation exists on the reading side, I think.
> >
> > Maybe there's some reason why the read/write_compress_length() can
> > never cross a page boundary, but it did strike me as odd.
> >
> > Linus
> >
>
>

2021-11-05 20:30:14

by David Sterba

[permalink] [raw]
Subject: Re: Kmap-related crashes and memory leaks on 32bit arch (5.15+)

On Thu, Nov 04, 2021 at 04:37:25PM -0700, Linus Torvalds wrote:
> On Thu, Nov 4, 2021 at 3:09 PM Linus Torvalds
> <[email protected]> wrote:
> > If (a) works, but (b) still fails, then it must be some odd
> > interaction issug withs-----ing else. Which sounds unlikely, since I
> > don't think we really had anything that should affect kmap or anything
> > in this area, but who knows...
>
> And bisection ends up perhaps somewhat painful, but sounds like the
> way to go if there's no other path forward.

Just to give an update, I tested several merge commits and the btrfs
merge is the first bad (037c50bfbeb33b4c).

Last good is the one right before that,

9c6e8d52a7299 Merge tag 'exfat-for-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat
(plus the fixup to make it compile e66435936756d9bce)

The remaining test to do is the merge conflict resolved by me, as you
suggested.

2021-11-16 15:43:50

by David Sterba

[permalink] [raw]
Subject: Re: Kmap-related crashes and memory leaks on 32bit arch (5.15+)

On Fri, Nov 05, 2021 at 08:50:04PM +0100, David Sterba wrote:
> On Thu, Nov 04, 2021 at 04:37:25PM -0700, Linus Torvalds wrote:
> > On Thu, Nov 4, 2021 at 3:09 PM Linus Torvalds
> > <[email protected]> wrote:
> > > If (a) works, but (b) still fails, then it must be some odd
> > > interaction issug withs-----ing else. Which sounds unlikely, since I
> > > don't think we really had anything that should affect kmap or anything
> > > in this area, but who knows...
> >
> > And bisection ends up perhaps somewhat painful, but sounds like the
> > way to go if there's no other path forward.
>
> Just to give an update, I tested several merge commits and the btrfs
> merge is the first bad (037c50bfbeb33b4c).
>
> Last good is the one right before that,
>
> 9c6e8d52a7299 Merge tag 'exfat-for-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat
> (plus the fixup to make it compile e66435936756d9bce)
>
> The remaining test to do is the merge conflict resolved by me, as you
> suggested.

So the result is that the merge conflict from Linus resolved the kmaps
correctly, there was a bug in the lzo refactoring patch itself that
caused some page array overflow and some random address was accessed.

I'll send a pull request in a day, I've been validating the patch on the
unmerged base and on rc1, also with various debugging options on.
Curiously, with SLOB there's an early crash while still doing the
compression, but with SLUB (and debugging enabled) there is not an
immediate crash but some other warning/assertion notices a page with
mapping (unexpected).