Hello,
syzbot found the following issue on:
HEAD commit: 80bd9028feca Add linux-next specific files for 20230131
git tree: linux-next
console output: https://syzkaller.appspot.com/x/log.txt?x=1468e369480000
kernel config: https://syzkaller.appspot.com/x/.config?x=904dc2f450eaad4a
dashboard link: https://syzkaller.appspot.com/bug?extid=a440341a59e3b7142895
compiler: gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
syz repro: https://syzkaller.appspot.com/x/repro.syz?x=12c5d2be480000
C reproducer: https://syzkaller.appspot.com/x/repro.c?x=11259a79480000
Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/924618188238/disk-80bd9028.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/7a03cf86e545/vmlinux-80bd9028.xz
kernel image: https://storage.googleapis.com/syzbot-assets/568e80043a41/bzImage-80bd9028.xz
The issue was bisected to:
commit 920756a3306a35f1c08f25207d375885bef98975
Author: David Howells <[email protected]>
Date: Sat Jan 21 12:51:18 2023 +0000
block: Convert bio_iov_iter_get_pages to use iov_iter_extract_pages
bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=170384f9480000
final oops: https://syzkaller.appspot.com/x/report.txt?x=148384f9480000
console output: https://syzkaller.appspot.com/x/log.txt?x=108384f9480000
IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: [email protected]
Fixes: 920756a3306a ("block: Convert bio_iov_iter_get_pages to use iov_iter_extract_pages")
general protection fault, probably for non-canonical address 0xdffffc0000000001: 0000 [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
CPU: 0 PID: 2838 Comm: kworker/u4:6 Not tainted 6.2.0-rc6-next-20230131-syzkaller-09515-g80bd9028feca #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/12/2023
Workqueue: phy4 ieee80211_iface_work
RIP: 0010:__skb_unlink include/linux/skbuff.h:2321 [inline]
RIP: 0010:__skb_dequeue include/linux/skbuff.h:2337 [inline]
RIP: 0010:skb_dequeue+0xf5/0x180 net/core/skbuff.c:3511
Code: 8d 7e 08 49 8b 5c 24 08 48 b8 00 00 00 00 00 fc ff df 49 c7 44 24 08 00 00 00 00 48 89 fa 49 c7 04 24 00 00 00 00 48 c1 ea 03 <80> 3c 02 00 75 6d 48 89 da 49 89 5e 08 48 b8 00 00 00 00 00 fc ff
RSP: 0018:ffffc9000ca2fc80 EFLAGS: 00010002
RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffffffff8808951d RDI: 0000000000000008
RBP: 0000000000000293 R08: 0000000000000001 R09: 0000000000000003
R10: fffff52001945f7e R11: 0000000000000000 R12: ffff88801d8f63c0
R13: ffff888075675880 R14: 0000000000000000 R15: ffff888075675868
FS: 0000000000000000(0000) GS:ffff8880b9800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f4a51f6d150 CR3: 0000000072a78000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
ieee80211_iface_work+0x369/0xd70 net/mac80211/iface.c:1631
process_one_work+0x9bf/0x1820 kernel/workqueue.c:2390
worker_thread+0x669/0x1090 kernel/workqueue.c:2537
kthread+0x2e8/0x3a0 kernel/kthread.c:376
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308
</TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---
RIP: 0010:__skb_unlink include/linux/skbuff.h:2321 [inline]
RIP: 0010:__skb_dequeue include/linux/skbuff.h:2337 [inline]
RIP: 0010:skb_dequeue+0xf5/0x180 net/core/skbuff.c:3511
Code: 8d 7e 08 49 8b 5c 24 08 48 b8 00 00 00 00 00 fc ff df 49 c7 44 24 08 00 00 00 00 48 89 fa 49 c7 04 24 00 00 00 00 48 c1 ea 03 <80> 3c 02 00 75 6d 48 89 da 49 89 5e 08 48 b8 00 00 00 00 00 fc ff
RSP: 0018:ffffc9000ca2fc80 EFLAGS: 00010002
RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffffffff8808951d RDI: 0000000000000008
RBP: 0000000000000293 R08: 0000000000000001 R09: 0000000000000003
R10: fffff52001945f7e R11: 0000000000000000 R12: ffff88801d8f63c0
R13: ffff888075675880 R14: 0000000000000000 R15: ffff888075675868
FS: 0000000000000000(0000) GS:ffff8880b9800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f4a51f6d150 CR3: 0000000072a78000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
----------------
Code disassembly (best guess):
0: 8d 7e 08 lea 0x8(%rsi),%edi
3: 49 8b 5c 24 08 mov 0x8(%r12),%rbx
8: 48 b8 00 00 00 00 00 movabs $0xdffffc0000000000,%rax
f: fc ff df
12: 49 c7 44 24 08 00 00 movq $0x0,0x8(%r12)
19: 00 00
1b: 48 89 fa mov %rdi,%rdx
1e: 49 c7 04 24 00 00 00 movq $0x0,(%r12)
25: 00
26: 48 c1 ea 03 shr $0x3,%rdx
* 2a: 80 3c 02 00 cmpb $0x0,(%rdx,%rax,1) <-- trapping instruction
2e: 75 6d jne 0x9d
30: 48 89 da mov %rbx,%rdx
33: 49 89 5e 08 mov %rbx,0x8(%r14)
37: 48 rex.W
38: b8 00 00 00 00 mov $0x0,%eax
3d: 00 fc add %bh,%ah
3f: ff .byte 0xff
---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at [email protected].
syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
For information about bisection process see: https://goo.gl/tpsmEJ#bisection
syzbot can test patches for this issue, for details see:
https://goo.gl/tpsmEJ#testing-patches
Hi John, David,
Could you have a look at this?
> syzbot found the following issue on:
>
> HEAD commit: 80bd9028feca Add linux-next specific files for 20230131
> git tree: linux-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=1468e369480000
> kernel config: https://syzkaller.appspot.com/x/.config?x=904dc2f450eaad4a
> dashboard link: https://syzkaller.appspot.com/bug?extid=a440341a59e3b7142895
> compiler: gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
> syz repro: https://syzkaller.appspot.com/x/repro.syz?x=12c5d2be480000
> C reproducer: https://syzkaller.appspot.com/x/repro.c?x=11259a79480000
> ...
> The issue was bisected to:
>
> commit 920756a3306a35f1c08f25207d375885bef98975
> Author: David Howells <[email protected]>
> Date: Sat Jan 21 12:51:18 2023 +0000
>
> block: Convert bio_iov_iter_get_pages to use iov_iter_extract_pages
>
> bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=170384f9480000
> final oops: https://syzkaller.appspot.com/x/report.txt?x=148384f9480000
> console output: https://syzkaller.appspot.com/x/log.txt?x=108384f9480000
> ...
> general protection fault, probably for non-canonical address 0xdffffc0000000001: 0000 [#1] PREEMPT SMP KASAN
> KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
> CPU: 0 PID: 2838 Comm: kworker/u4:6 Not tainted 6.2.0-rc6-next-20230131-syzkaller-09515-g80bd9028feca #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/12/2023
> Workqueue: phy4 ieee80211_iface_work
> RIP: 0010:__skb_unlink include/linux/skbuff.h:2321 [inline]
> RIP: 0010:__skb_dequeue include/linux/skbuff.h:2337 [inline]
> RIP: 0010:skb_dequeue+0xf5/0x180 net/core/skbuff.c:3511
I don't think this is specifically related to anything networking. I've run
it a few times and weird stuff happens in various places. I'm wondering if
it's related to FOLL_PIN in some way.
The syzbot test in question does the following:
#{"repeat":true,"procs":1,"slowdown":1,"sandbox":"none","sandbox_arg":0,"netdev":true,"cgroups":true,"close_fds":true,"usb":true,"wifi":true,"sysctl":true,"tmpdir":true}
socket(0x0, 0x2, 0x0)
epoll_create(0x7)
r0 = creat(&(0x7f0000000040)='./bus\x00', 0x9)
ftruncate(r0, 0x800)
lseek(r0, 0x200, 0x2)
r1 = open(&(0x7f0000000000)='./bus\x00', 0x24000, 0x0) <-- O_DIRECT
sendfile(r0, r1, 0x0, 0x1dd00)
Basically a DIO splice from a file to itself.
I've hand-written my own much simpler tester (see attached). You need to run
at least two copies in parallel, I think, to trigger the bug. It's possible
truncate is interfering somehow.
David
---
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/sendfile.h>
#include <sys/wait.h>
#define file_size 0x800
#define send_size 0x1dd00
#define repeat_count 1000
int main(int argc, char *argv[])
{
int in, out, i, wt;
if (argc != 2 || !argv[1][0]) {
fprintf(stderr, "Usage: %s <file>\n", argv[0]);
exit(2);
}
for (i = 0; i < repeat_count; i++) {
switch (fork()) {
case -1:
perror("fork");
exit(1);
case 0:
out = creat(argv[1], 0666);
if (out < 0) {
perror(argv[1]);
exit(1);
}
if (ftruncate(out, file_size) < 0) {
perror("ftruncate");
exit(1);
}
if (lseek(out, file_size, SEEK_SET) < 0) {
perror("lseek");
exit(1);
}
in = open(argv[1], O_RDONLY | O_DIRECT | O_NOFOLLOW);
if (in < 0) {
perror("open");
exit(1);
}
if (sendfile(out, in, NULL, send_size) < 0) {
perror("sendfile");
exit(1);
}
exit(0);
default:
if (wait(&wt) < 0) {
perror("wait");
exit(1);
}
break;
}
}
exit(0);
}
On 02.02.23 09:52, David Howells wrote:
> Hi John, David,
>
> Could you have a look at this?
>
>> syzbot found the following issue on:
>>
>> HEAD commit: 80bd9028feca Add linux-next specific files for 20230131
>> git tree: linux-next
>> console output: https://syzkaller.appspot.com/x/log.txt?x=1468e369480000
>> kernel config: https://syzkaller.appspot.com/x/.config?x=904dc2f450eaad4a
>> dashboard link: https://syzkaller.appspot.com/bug?extid=a440341a59e3b7142895
>> compiler: gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
>> syz repro: https://syzkaller.appspot.com/x/repro.syz?x=12c5d2be480000
>> C reproducer: https://syzkaller.appspot.com/x/repro.c?x=11259a79480000
>> ...
>> The issue was bisected to:
>>
>> commit 920756a3306a35f1c08f25207d375885bef98975
>> Author: David Howells <[email protected]>
>> Date: Sat Jan 21 12:51:18 2023 +0000
>>
>> block: Convert bio_iov_iter_get_pages to use iov_iter_extract_pages
>>
>> bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=170384f9480000
>> final oops: https://syzkaller.appspot.com/x/report.txt?x=148384f9480000
>> console output: https://syzkaller.appspot.com/x/log.txt?x=108384f9480000
>> ...
>> general protection fault, probably for non-canonical address 0xdffffc0000000001: 0000 [#1] PREEMPT SMP KASAN
>> KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
>> CPU: 0 PID: 2838 Comm: kworker/u4:6 Not tainted 6.2.0-rc6-next-20230131-syzkaller-09515-g80bd9028feca #0
>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/12/2023
>> Workqueue: phy4 ieee80211_iface_work
>> RIP: 0010:__skb_unlink include/linux/skbuff.h:2321 [inline]
>> RIP: 0010:__skb_dequeue include/linux/skbuff.h:2337 [inline]
>> RIP: 0010:skb_dequeue+0xf5/0x180 net/core/skbuff.c:3511
>
> I don't think this is specifically related to anything networking. I've run
> it a few times and weird stuff happens in various places. I'm wondering if
> it's related to FOLL_PIN in some way.
>
> The syzbot test in question does the following:
>
> #{"repeat":true,"procs":1,"slowdown":1,"sandbox":"none","sandbox_arg":0,"netdev":true,"cgroups":true,"close_fds":true,"usb":true,"wifi":true,"sysctl":true,"tmpdir":true}
> socket(0x0, 0x2, 0x0)
> epoll_create(0x7)
> r0 = creat(&(0x7f0000000040)='./bus\x00', 0x9)
> ftruncate(r0, 0x800)
> lseek(r0, 0x200, 0x2)
> r1 = open(&(0x7f0000000000)='./bus\x00', 0x24000, 0x0) <-- O_DIRECT
> sendfile(r0, r1, 0x0, 0x1dd00)
>
> Basically a DIO splice from a file to itself.
>
> I've hand-written my own much simpler tester (see attached). You need to run
> at least two copies in parallel, I think, to trigger the bug. It's possible
> truncate is interfering somehow.
>
> David
> ---
> #define _GNU_SOURCE
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <fcntl.h>
> #include <sys/sendfile.h>
> #include <sys/wait.h>
>
> #define file_size 0x800
> #define send_size 0x1dd00
> #define repeat_count 1000
>
> int main(int argc, char *argv[])
> {
> int in, out, i, wt;
>
> if (argc != 2 || !argv[1][0]) {
> fprintf(stderr, "Usage: %s <file>\n", argv[0]);
> exit(2);
> }
>
> for (i = 0; i < repeat_count; i++) {
> switch (fork()) {
> case -1:
> perror("fork");
> exit(1);
> case 0:
> out = creat(argv[1], 0666);
> if (out < 0) {
> perror(argv[1]);
> exit(1);
> }
>
> if (ftruncate(out, file_size) < 0) {
> perror("ftruncate");
> exit(1);
> }
>
> if (lseek(out, file_size, SEEK_SET) < 0) {
> perror("lseek");
> exit(1);
> }
>
> in = open(argv[1], O_RDONLY | O_DIRECT | O_NOFOLLOW);
> if (in < 0) {
> perror("open");
> exit(1);
> }
>
> if (sendfile(out, in, NULL, send_size) < 0) {
> perror("sendfile");
> exit(1);
> }
> exit(0);
[as raised on IRC]
At first, I wondered if that's related to shared anonymous pages getting
pinned R/O that would trigger COW-unsharing ... but I don't even see
where we are supposed to use FOLL_PIN vs. FOLL_GET here? IOW, we're not
even supposed to access user space memory (neither FOLL_GET nor
FOLL_PIN) but still end up with a change in behavior.
--
Thanks,
David / dhildenb
David Hildenbrand <[email protected]> wrote:
> At first, I wondered if that's related to shared anonymous pages getting
> pinned R/O that would trigger COW-unsharing ... but I don't even see where we
> are supposed to use FOLL_PIN vs. FOLL_GET here? IOW, we're not even supposed
> to access user space memory (neither FOLL_GET nor FOLL_PIN) but still end up
> with a change in behavior.
I'm not sure that using FOLL_PIN is part of the problem here.
sendfile() is creating a transient buffer attached to a pipe, doing a DIO read
into it (which uses iov_iter_extract_pages() to populate a bio) and then feeds
the pages from the pipe one at a time using a BVEC-type iterator to a buffered
write.
Note that iov_iter_extract_pages() does not pin/get pages when accessing a
BVEC, KVEC, XARRAY or PIPE iterator. However, in this case, this should not
matter as the pipe is holding refs on the pages in the buffer.
I have found that the issue goes away if I add an extra get_page() into
iov_iter_extract_pipe_pages() and don't release it (the page is then leaked).
This makes me think that the problem might be due to the pages getting
recycled from the pipe before DIO has finished writing to them - but that
shouldn't be the case as the splice has to be synchronous - we can't mark data
in the pipe as 'produced' until we've finished reading it.
I've also turned on CONFIG_DEBUG_PAGE_REF with a twist of my own to use a page
flag to mark the pages of interest and only trace those pages (see attached
patch) - and that consistently shows the pages being released after the bio is
cleared.
Further, when it does arise, the issue only happens when two or more copies of
the test program are run in parallel - which makes me wonder if truncate is
implicated, except that that would have to happen inside the filesystem and be
nothing to do with the pipe in the middle of sendfile().
David
--
commit 8ec9d7d951166d77e283079151b496632c290958
Author: David Howells <[email protected]>
Date: Fri Nov 13 13:21:01 2020 +0000
mm: Add a page bit to trigger page_ref tracing
Add a facility to mark a page using an extra page bit so that the page is
flagged in tracing. Also make it possible to restrict the tracepoint to
only marked pages. The mark is automatically removed when the page is
freed.
Enable the followng:
CONFIG_DEBUG_PAGE_REF
CONFIG_DEBUG_PAGE_MARK
CONFIG_DEBUG_PAGE_REF_ONLY_MARKED
and then enable the page_ref tracepoints:
echo 1 >/sys/kernel/debug/tracing/events/page_ref/enable
and it shows the fate of pages created by readahead for the netfs helpers.
This also contains some bits to turn it on. If you want to trace a
filesystem other than XFS, you need to change the magic numbers in the
patch.
Signed-off-by: David Howells <[email protected]>
diff --git a/block/bio.c b/block/bio.c
index fc57f0aa098e..69b62ebf68ed 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -20,6 +20,7 @@
#include <linux/blk-crypto.h>
#include <linux/xarray.h>
+#include <trace/events/page_ref.h>
#include <trace/events/block.h>
#include "blk.h"
#include "blk-rq-qos.h"
@@ -1174,10 +1175,17 @@ void __bio_release_pages(struct bio *bio, bool mark_dirty)
struct bio_vec *bvec;
bio_for_each_segment_all(bvec, bio, iter_all) {
- if (mark_dirty && !PageCompound(bvec->bv_page))
- set_page_dirty_lock(bvec->bv_page);
- bio_release_page(bio, bvec->bv_page);
+ if (PageDebugMark(bvec->bv_page))
+ trace_page_ref_set(bvec->bv_page, 999);
}
+
+ if (bio_flagged(bio, BIO_PAGE_REFFED) ||
+ bio_flagged(bio, BIO_PAGE_PINNED))
+ bio_for_each_segment_all(bvec, bio, iter_all) {
+ if (mark_dirty && !PageCompound(bvec->bv_page))
+ set_page_dirty_lock(bvec->bv_page);
+ bio_release_page(bio, bvec->bv_page);
+ }
}
EXPORT_SYMBOL_GPL(__bio_release_pages);
diff --git a/fs/splice.c b/fs/splice.c
index 5969b7a1d353..4c2f13f5d6d5 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -304,6 +304,7 @@ ssize_t generic_file_splice_read(struct file *in, loff_t *ppos,
int ret;
iov_iter_pipe(&to, ITER_DEST, pipe, len);
+ to.debug = true;
init_sync_kiocb(&kiocb, in);
kiocb.ki_pos = *ppos;
ret = call_read_iter(in, &kiocb, &to);
diff --git a/include/linux/bio.h b/include/linux/bio.h
index b2c09997d79c..cafa26637067 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -484,8 +484,8 @@ void zero_fill_bio(struct bio *bio);
static inline void bio_release_pages(struct bio *bio, bool mark_dirty)
{
- if (bio_flagged(bio, BIO_PAGE_REFFED) ||
- bio_flagged(bio, BIO_PAGE_PINNED))
+ //if (bio_flagged(bio, BIO_PAGE_REFFED) ||
+ // bio_flagged(bio, BIO_PAGE_PINNED))
__bio_release_pages(bio, mark_dirty);
}
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 69e93a0c1277..80cbf784239e 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -138,6 +138,9 @@ enum pageflags {
#endif
#ifdef CONFIG_KASAN_HW_TAGS
PG_skip_kasan_poison,
+#endif
+#ifdef CONFIG_DEBUG_PAGE_MARK
+ PG_debug_mark,
#endif
__NR_PAGEFLAGS,
@@ -694,6 +697,15 @@ static __always_inline bool PageKsm(struct page *page)
TESTPAGEFLAG_FALSE(Ksm, ksm)
#endif
+#ifdef CONFIG_DEBUG_PAGE_MARK
+/*
+ * Debug marks are just used for page_ref tracepoint control and display.
+ */
+PAGEFLAG(DebugMark, debug_mark, PF_ANY)
+#else
+TESTPAGEFLAG_FALSE(DebugMark, debug_mark)
+#endif
+
u64 stable_page_flags(struct page *page);
/**
diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
index d7c2d33baa7f..7bc1a94d9cbb 100644
--- a/include/linux/page_ref.h
+++ b/include/linux/page_ref.h
@@ -24,7 +24,11 @@ DECLARE_TRACEPOINT(page_ref_unfreeze);
*
* See trace_##name##_enabled(void) in include/linux/tracepoint.h
*/
-#define page_ref_tracepoint_active(t) tracepoint_enabled(t)
+#ifndef CONFIG_DEBUG_PAGE_REF_ONLY_MARKED
+#define page_ref_tracepoint_active(p, t) tracepoint_enabled(t)
+#else
+#define page_ref_tracepoint_active(p, t) (tracepoint_enabled(t) && PageDebugMark(p))
+#endif
extern void __page_ref_set(struct page *page, int v);
extern void __page_ref_mod(struct page *page, int v);
@@ -36,7 +40,7 @@ extern void __page_ref_unfreeze(struct page *page, int v);
#else
-#define page_ref_tracepoint_active(t) false
+#define page_ref_tracepoint_active(page, t) false
static inline void __page_ref_set(struct page *page, int v)
{
@@ -97,7 +101,7 @@ static inline int page_count(const struct page *page)
static inline void set_page_count(struct page *page, int v)
{
atomic_set(&page->_refcount, v);
- if (page_ref_tracepoint_active(page_ref_set))
+ if (page_ref_tracepoint_active(page, page_ref_set))
__page_ref_set(page, v);
}
@@ -118,7 +122,7 @@ static inline void init_page_count(struct page *page)
static inline void page_ref_add(struct page *page, int nr)
{
atomic_add(nr, &page->_refcount);
- if (page_ref_tracepoint_active(page_ref_mod))
+ if (page_ref_tracepoint_active(page, page_ref_mod))
__page_ref_mod(page, nr);
}
@@ -130,7 +134,7 @@ static inline void folio_ref_add(struct folio *folio, int nr)
static inline void page_ref_sub(struct page *page, int nr)
{
atomic_sub(nr, &page->_refcount);
- if (page_ref_tracepoint_active(page_ref_mod))
+ if (page_ref_tracepoint_active(page, page_ref_mod))
__page_ref_mod(page, -nr);
}
@@ -143,7 +147,7 @@ static inline int page_ref_sub_return(struct page *page, int nr)
{
int ret = atomic_sub_return(nr, &page->_refcount);
- if (page_ref_tracepoint_active(page_ref_mod_and_return))
+ if (page_ref_tracepoint_active(page, page_ref_mod_and_return))
__page_ref_mod_and_return(page, -nr, ret);
return ret;
}
@@ -156,7 +160,7 @@ static inline int folio_ref_sub_return(struct folio *folio, int nr)
static inline void page_ref_inc(struct page *page)
{
atomic_inc(&page->_refcount);
- if (page_ref_tracepoint_active(page_ref_mod))
+ if (page_ref_tracepoint_active(page, page_ref_mod))
__page_ref_mod(page, 1);
}
@@ -168,7 +172,7 @@ static inline void folio_ref_inc(struct folio *folio)
static inline void page_ref_dec(struct page *page)
{
atomic_dec(&page->_refcount);
- if (page_ref_tracepoint_active(page_ref_mod))
+ if (page_ref_tracepoint_active(page, page_ref_mod))
__page_ref_mod(page, -1);
}
@@ -181,7 +185,7 @@ static inline int page_ref_sub_and_test(struct page *page, int nr)
{
int ret = atomic_sub_and_test(nr, &page->_refcount);
- if (page_ref_tracepoint_active(page_ref_mod_and_test))
+ if (page_ref_tracepoint_active(page, page_ref_mod_and_test))
__page_ref_mod_and_test(page, -nr, ret);
return ret;
}
@@ -195,7 +199,7 @@ static inline int page_ref_inc_return(struct page *page)
{
int ret = atomic_inc_return(&page->_refcount);
- if (page_ref_tracepoint_active(page_ref_mod_and_return))
+ if (page_ref_tracepoint_active(page, page_ref_mod_and_return))
__page_ref_mod_and_return(page, 1, ret);
return ret;
}
@@ -209,7 +213,7 @@ static inline int page_ref_dec_and_test(struct page *page)
{
int ret = atomic_dec_and_test(&page->_refcount);
- if (page_ref_tracepoint_active(page_ref_mod_and_test))
+ if (page_ref_tracepoint_active(page, page_ref_mod_and_test))
__page_ref_mod_and_test(page, -1, ret);
return ret;
}
@@ -223,7 +227,7 @@ static inline int page_ref_dec_return(struct page *page)
{
int ret = atomic_dec_return(&page->_refcount);
- if (page_ref_tracepoint_active(page_ref_mod_and_return))
+ if (page_ref_tracepoint_active(page, page_ref_mod_and_return))
__page_ref_mod_and_return(page, -1, ret);
return ret;
}
@@ -237,7 +241,7 @@ static inline bool page_ref_add_unless(struct page *page, int nr, int u)
{
bool ret = atomic_add_unless(&page->_refcount, nr, u);
- if (page_ref_tracepoint_active(page_ref_mod_unless))
+ if (page_ref_tracepoint_active(page, page_ref_mod_unless))
__page_ref_mod_unless(page, nr, ret);
return ret;
}
@@ -317,7 +321,7 @@ static inline int page_ref_freeze(struct page *page, int count)
{
int ret = likely(atomic_cmpxchg(&page->_refcount, count, 0) == count);
- if (page_ref_tracepoint_active(page_ref_freeze))
+ if (page_ref_tracepoint_active(page, page_ref_freeze))
__page_ref_freeze(page, count, ret);
return ret;
}
@@ -333,7 +337,7 @@ static inline void page_ref_unfreeze(struct page *page, int count)
VM_BUG_ON(count == 0);
atomic_set_release(&page->_refcount, count);
- if (page_ref_tracepoint_active(page_ref_unfreeze))
+ if (page_ref_tracepoint_active(page, page_ref_unfreeze))
__page_ref_unfreeze(page, count);
}
diff --git a/include/linux/uio.h b/include/linux/uio.h
index 514e3b7b06b8..89272c05d74d 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -45,6 +45,7 @@ struct iov_iter {
bool nofault;
bool data_source;
bool user_backed;
+ bool debug;
union {
size_t iov_offset;
int last_offset;
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 412b5a46374c..5f3b9b0e4b53 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -103,6 +103,12 @@
#define IF_HAVE_PG_SKIP_KASAN_POISON(flag,string)
#endif
+#ifdef CONFIG_DEBUG_PAGE_MARK
+#define IF_HAVE_PG_DEBUG_MARK(flag,string) ,{1UL << flag, string}
+#else
+#define IF_HAVE_PG_DEBUG_MARK(flag,string)
+#endif
+
#define __def_pageflag_names \
{1UL << PG_locked, "locked" }, \
{1UL << PG_waiters, "waiters" }, \
@@ -132,7 +138,8 @@ IF_HAVE_PG_IDLE(PG_young, "young" ) \
IF_HAVE_PG_IDLE(PG_idle, "idle" ) \
IF_HAVE_PG_ARCH_X(PG_arch_2, "arch_2" ) \
IF_HAVE_PG_ARCH_X(PG_arch_3, "arch_3" ) \
-IF_HAVE_PG_SKIP_KASAN_POISON(PG_skip_kasan_poison, "skip_kasan_poison")
+IF_HAVE_PG_SKIP_KASAN_POISON(PG_skip_kasan_poison, "skip_kasan_poison") \
+IF_HAVE_PG_DEBUG_MARK(PG_debug_mark, "debug_mark" )
#define show_page_flags(flags) \
(flags) ? __print_flags(flags, "|", \
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index d69a05950555..7ac030208a2c 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -10,9 +10,11 @@
#include <linux/vmalloc.h>
#include <linux/splice.h>
#include <linux/compat.h>
+#include <linux/page-flags.h>
#include <net/checksum.h>
#include <linux/scatterlist.h>
#include <linux/instrumented.h>
+#include <trace/events/page_ref.h>
#define PIPE_PARANOIA /* for now */
@@ -1331,6 +1333,10 @@ static ssize_t pipe_get_pages(struct iov_iter *i,
struct page *page = append_pipe(i, left, &off);
if (!page)
break;
+ if (i->debug && !PageDebugMark(page)) {
+ //SetPageDebugMark(page);
+ get_page(page);
+ }
chunk = min_t(size_t, left, PAGE_SIZE - off);
get_page(*p++ = page);
}
@@ -1955,6 +1961,11 @@ static ssize_t iov_iter_extract_pipe_pages(struct iov_iter *i,
struct page *page = append_pipe(i, left, &offset);
if (!page)
break;
+ if (i->debug && !PageDebugMark(page)) {
+ SetPageDebugMark(page);
+ trace_page_ref_set(page, 888);
+ //get_page(page);
+ }
chunk = min_t(size_t, left, PAGE_SIZE - offset);
left -= chunk;
*p++ = page;
diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
index fca699ad1fb0..111a946a676f 100644
--- a/mm/Kconfig.debug
+++ b/mm/Kconfig.debug
@@ -149,6 +149,23 @@ config DEBUG_PAGE_REF
kernel code. However the runtime performance overhead is virtually
nil until the tracepoints are actually enabled.
+config DEBUG_PAGE_MARK
+ bool "Reserve a page bit to mark pages to be debugged"
+ depends on DEBUG_PAGE_REF
+ help
+ This option adds an extra page flag that can be used to mark pages
+ for debugging. The mark can be observed in the page_ref tracepoints.
+ The mark isn't set on any pages without alteration of the code. This
+ is intended for filesystem debugging and code to set the mark must be
+ added manually into the source.
+
+config DEBUG_PAGE_REF_ONLY_MARKED
+ bool "Only trace marked pages"
+ depends on DEBUG_PAGE_REF && DEBUG_PAGE_MARK
+ help
+ This option restricts the page_ref tracepoints to only track marked
+ pages.
+
config DEBUG_RODATA_TEST
bool "Testcase for the marking rodata read-only"
depends on STRICT_KERNEL_RWX
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0745aedebb37..37f146e5b2eb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1102,6 +1102,9 @@ static inline void __free_one_page(struct page *page,
VM_BUG_ON(!zone_is_initialized(zone));
VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
+#ifdef CONFIG_DEBUG_PAGE_MARK
+ ClearPageDebugMark(page);
+#endif
VM_BUG_ON(migratetype == -1);
if (likely(!is_migrate_isolate(migratetype)))
diff --git a/mm/readahead.c b/mm/readahead.c
index b10f0cf81d80..c5558daf3a56 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -248,6 +248,10 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
folio = filemap_alloc_folio(gfp_mask, 0);
if (!folio)
break;
+#define XFS_SUPER_MAGIC 0x58465342 /* "XFSB" */
+ if (mapping->host->i_sb->s_magic == XFS_SUPER_MAGIC)
+ folio_set_debug_mark(folio);
+
if (filemap_add_folio(mapping, folio, index + i,
gfp_mask) < 0) {
folio_put(folio);
@@ -809,6 +813,7 @@ void readahead_expand(struct readahead_control *ractl,
page = __page_cache_alloc(gfp_mask);
if (!page)
return;
+ //SetPageDebugMark(page);
if (add_to_page_cache_lru(page, mapping, index, gfp_mask) < 0) {
put_page(page);
return;
@@ -832,6 +837,7 @@ void readahead_expand(struct readahead_control *ractl,
page = __page_cache_alloc(gfp_mask);
if (!page)
return;
+ //SetPageDebugMark(page);
if (add_to_page_cache_lru(page, mapping, index, gfp_mask) < 0) {
put_page(page);
return;
On 2/2/23 00:52, David Howells wrote:
> Hi John, David,
>
> Could you have a look at this?
Sure. So far, I have reproduced a crash using your simplified test
program (it required three simulaneous running copies), and will look
deeper now.
In case it illuminates anything, the crash looked like this (below), and
was obtained *without* setting KASAN. Also a minor point: this is from a
git branch of the last commit in the series (commit fd20d0c1852e "block:
convert bio_map_user_iov to use iov_iter_extract_pages"), rather than
from top of linux-next.
Kernel panic - not syncing: corrupted stack end detected inside scheduler
CPU: 2 PID: 27177 Comm: syzbot_howells Not tainted 6.2.0-rc5-hubbard-github+ #3
Hardware name: ASUS X299-A/PRIME X299-A, BIOS 1503 08/03/2018
Call Trace:
<TASK>
dump_stack_lvl+0x4c/0x63
panic+0x113/0x2c4
? folio_wait_bit_common+0xf6/0x360
__schedule+0xd1b/0xd20
schedule+0x5d/0xe0
io_schedule+0x42/0x70
folio_wait_bit_common+0x123/0x360
? __pfx_wake_page_function+0x10/0x10
folio_wait_writeback+0x24/0x100
__filemap_fdatawait_range+0x7a/0x120
? filemap_fdatawrite_wbc+0x69/0x80
? __filemap_fdatawrite_range+0x58/0x80
filemap_write_and_wait_range+0x84/0xb0
__iomap_dio_rw+0x183/0x830
? __lock_acquire+0x3b4/0x2620
iomap_dio_rw+0xe/0x40
ext4_file_read_iter+0x141/0x1c0
generic_file_splice_read+0x90/0x160
splice_direct_to_actor+0xb1/0x210
? __pfx_direct_splice_actor+0x10/0x10
do_splice_direct+0x8c/0xd0
do_sendfile+0x352/0x600
do_syscall_64+0x37/0x90
entry_SYSCALL_64_after_hwframe+0x72/0xdc
RIP: 0033:0x7f322d5116be
Code: c3 0f 1f 00 4c 89 d2 4c 89 c6 e9 fd fd ff ff 0f 1f 44 00 00 31 c0 c3 0f 1f 44 00 00 f3 0f 1e fa 49 89 ca b8 28 00 00 00 0f 05 <48> 3d 01 f0 ff ff8
RSP: 002b:00007ffd8c914538 EFLAGS: 00000202 ORIG_RAX: 0000000000000028
RAX: ffffffffffffffda RBX: 00007ffd8c914678 RCX: 00007f322d5116be
RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000003
RBP: 0000000000000003 R08: 0000000000000001 R09: 00007f322d7f6740
R10: 000000000001dd00 R11: 0000000000000202 R12: 0000000000000000
R13: 00007ffd8c914690 R14: 0000558a11e29d78 R15: 00007f322d843020
</TASK>
thanks,
--
John Hubbard
NVIDIA
On 2/2/23 07:15, David Howells wrote:
> David Hildenbrand <[email protected]> wrote:
>
>> At first, I wondered if that's related to shared anonymous pages getting
>> pinned R/O that would trigger COW-unsharing ... but I don't even see where we
>> are supposed to use FOLL_PIN vs. FOLL_GET here? IOW, we're not even supposed
>> to access user space memory (neither FOLL_GET nor FOLL_PIN) but still end up
>> with a change in behavior.
>
> I'm not sure that using FOLL_PIN is part of the problem here.
I agree. It's really not.
>
> sendfile() is creating a transient buffer attached to a pipe, doing a DIO read
> into it (which uses iov_iter_extract_pages() to populate a bio) and then feeds
> the pages from the pipe one at a time using a BVEC-type iterator to a buffered
> write.
>
> Note that iov_iter_extract_pages() does not pin/get pages when accessing a
> BVEC, KVEC, XARRAY or PIPE iterator. However, in this case, this should not
> matter as the pipe is holding refs on the pages in the buffer.
>
> I have found that the issue goes away if I add an extra get_page() into
> iov_iter_extract_pipe_pages() and don't release it (the page is then leaked).
>
> This makes me think that the problem might be due to the pages getting
> recycled from the pipe before DIO has finished writing to them - but that
> shouldn't be the case as the splice has to be synchronous - we can't mark data
> in the pipe as 'produced' until we've finished reading it.
So I thought about this for a while, and one really big glaring point
that stands out for me is: before this commit, we had this:
iov_iter_get_pages()
__iov_iter_get_pages_alloc()
pipe_get_pages()
get_page()
But now, based on the claim from various folks that "pipe cases don't
require a get_page()", we have boldly--too boldy, I believe-- moved
directly into case that doesn't do a get_page():
iov_iter_extract_pipe_pages()
...(nothing)
And your testing backs this up: adding the get_page() back hides the
failure.
So that's as far as I've got, but I am really suspecting that this is
where the root cause is: pipe references are not as locked down as we
think they are. At least in this case.
thanks,
--
John Hubbard
NVIDIA
Hello,
syzbot has tested the proposed patch and the reproducer did not trigger any issue:
Reported-and-tested-by: [email protected]
Tested on:
commit: e7b5bbb5 iov_iter: Kill ITER_PIPE
git tree: https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/ iov-fixes
console output: https://syzkaller.appspot.com/x/log.txt?x=129d541f480000
kernel config: https://syzkaller.appspot.com/x/.config?x=318265a8e536ca3f
dashboard link: https://syzkaller.appspot.com/bug?extid=a440341a59e3b7142895
compiler: gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
Note: no patches were applied.
Note: testing is done by a robot and is best-effort only.