Received: by 2002:a05:6358:489b:b0:bb:da1:e618 with SMTP id x27csp7020527rwn; Tue, 13 Sep 2022 12:28:55 -0700 (PDT) X-Google-Smtp-Source: AA6agR6rrA/8wAvLi3ns5i4RAo0yqHQLbhKkxIJQVz2DakNRUpEuhIF1TI5bVN7ZaZ/YLOpDI8kw X-Received: by 2002:a17:907:60c6:b0:77c:e7ee:67c with SMTP id hv6-20020a17090760c600b0077ce7ee067cmr8741165ejc.425.1663097335436; Tue, 13 Sep 2022 12:28:55 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1663097335; cv=none; d=google.com; s=arc-20160816; b=yjTVF6D5ZjHGWsD5drKI/mDRgyeB9RZr1AinInMhpU5QPZUO1jeNsLE8+ugpo55Onm HdknfIEm88Mogki9UipmjsFRBzbFS+TMr621PS1lwCxCWIlkj9sjX2yA73Y9TzaCwHRr eJTd0nrzCYikr/ce7g1LUCgTlIe9nyyQHGvHa47I3VccboaKw/0d3i0a0Tg6gBazcnwv 6ozvPKcbIAn357y5XJfXo9iragg5wmMISAK0QSDUyTQxGq8MSgM0CVJcCkcb9M9PrwzH B9bOmkZF9EmNbT8OUVlVIOmfxy/K6s124ZeSmoog6/Xfejd47dsxvcBLlUF2LYoX+1s7 Tdvg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=orbJhFxE5HOptaUTrwjclSlFPR4aOjesoJrXhy8Wtts=; b=x0Ut4UVPXSKTgG9PkWZAaodYqoSt1s43ZqZ4dvqjMVUiZnWbj99J1eGnB+qA1JEUPJ u56LYW+Y1+qqsvJrYoyrLF3ztI3dOuUhVacSRFiFJegDQcFo8L3aJ9O1iOIKofdkeMx7 hOAKDch39CGODszPKmpVnCvHzsbSpaUl3O/iwj0k/WQtbjfQShKHwcN96Za0MFhhAQX3 uNrSGwvLvTO6KORuvVpfVJKIVNo1UhPSGIhjFK9GBrL7RFVdA1nP4xjFGTPDtkKl8xAj T2kW225uT2A+bit7aV48hw+20OmzkaBTPu3xh/wjiM21WEvnRv25M0iZrIVb8FIxf+pC QH9Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=nxcB8lDU; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id e16-20020a170906081000b00741c0bd7061si8046927ejd.644.2022.09.13.12.28.29; Tue, 13 Sep 2022 12:28:55 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=nxcB8lDU; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231479AbiIMSx1 (ORCPT + 99 others); Tue, 13 Sep 2022 14:53:27 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40990 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232403AbiIMSwx (ORCPT ); Tue, 13 Sep 2022 14:52:53 -0400 Received: from mail-pl1-x62e.google.com (mail-pl1-x62e.google.com [IPv6:2607:f8b0:4864:20::62e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E3ECD265A for ; Tue, 13 Sep 2022 11:39:09 -0700 (PDT) Received: by mail-pl1-x62e.google.com with SMTP id k21so2176688pls.11 for ; Tue, 13 Sep 2022 11:39:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date; bh=orbJhFxE5HOptaUTrwjclSlFPR4aOjesoJrXhy8Wtts=; b=nxcB8lDUCl73YdcfRhMEAqDdCUdLG4SLHkeUiNjkBu1FXJYofVBUFa3b707o90XVoQ dNihqZ1DF3x6fZK+7ed5nOm30GRdc8KQhyxvICRIrCs3q5fbHr7gvEcj9BxAKrVt5ura Eyp2ZTlMCeGGRfTF61nOW4+zycbvSDifROf38oAULgtsPWtZws7s+52oUGkNUPMi15Ot Elik2yEVEjeh2DhorPYPsoeAn1sUZ8/mZUAvcOOrI4HMR12OCP1bWQ/20ATKVAUw8l8U qO/Cu44DX012kPCxBcmORr7Lw4WNaCmPA7gmm2yupvUnKdABmng93qFfAHFEkIX3YiR9 8GDQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date; bh=orbJhFxE5HOptaUTrwjclSlFPR4aOjesoJrXhy8Wtts=; b=CmhV0v5CxsSUFjBFC8aGi90RVCWWhHEblgvLDJZSopVKt53R7rzrLHqVvPNUileIf1 /5IDr9TvcbRVRfanreN8I4n1GQ/85CKP8o9LGDdEp4XrFVqtbmqFOH4sdOQ1JYm5FoKX Yrl1PXvyRSuvc8vFNst/D73T+O+T7tnXkSjmwXzTPVTE7KPdMe/fJRq22RMd4oJGxYBN tI5cfT24pZpgW1kulsjE/r/+MFGTYC60JhHpCde/dIjR6a04ovpVenUcjUUqm/QTQMDf bR+HLOBhhTuReg08xFkwU3kblOUlBwEC2trcMuwnXHliWG4bRK8U949cijQLFuGk4kZR BgmA== X-Gm-Message-State: ACgBeo0HvrRx9QKT2/eYSbS3shLbISraJXJRdPFWzU2kQH0ShUuTAgCg bMvyJNX1avPfXhEMYOHoidEiWQ== X-Received: by 2002:a17:902:e2d3:b0:176:e97a:d3eb with SMTP id l19-20020a170902e2d300b00176e97ad3ebmr31754197plc.172.1663094349141; Tue, 13 Sep 2022 11:39:09 -0700 (PDT) Received: from google.com (33.5.83.34.bc.googleusercontent.com. [34.83.5.33]) by smtp.gmail.com with ESMTPSA id p67-20020a625b46000000b00540d03f3792sm8193827pfb.81.2022.09.13.11.39.08 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 13 Sep 2022 11:39:08 -0700 (PDT) Date: Tue, 13 Sep 2022 11:39:05 -0700 From: Zach O'Keefe To: Yang Shi Cc: syzbot , akpm@linux-foundation.org, andrii@kernel.org, ast@kernel.org, bigeasy@linutronix.de, bpf@vger.kernel.org, brauner@kernel.org, daniel@iogearbox.net, david@redhat.com, ebiederm@xmission.com, john.fastabend@gmail.com, kafai@fb.com, kpsingh@kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, luto@kernel.org, netdev@vger.kernel.org, songliubraving@fb.com, syzkaller-bugs@googlegroups.com, tglx@linutronix.de, yhs@fb.com Subject: Re: [syzbot] BUG: Bad page map (5) Message-ID: References: <000000000000f537cc05ddef88db@google.com> <0000000000007d793405e87350df@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sep 13 09:14, Yang Shi wrote: > On Mon, Sep 12, 2022 at 2:47 PM Yang Shi wrote: > > > > On Sun, Sep 11, 2022 at 9:27 PM syzbot > > wrote: > > > > > > syzbot has found a reproducer for the following issue on: > > > > > > HEAD commit: e47eb90a0a9a Add linux-next specific files for 20220901 > > > git tree: linux-next > > > console+strace: https://syzkaller.appspot.com/x/log.txt?x=17330430880000 > > > kernel config: https://syzkaller.appspot.com/x/.config?x=7933882276523081 > > > dashboard link: https://syzkaller.appspot.com/bug?extid=915f3e317adb0e85835f > > > compiler: gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2 > > > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=13397b77080000 > > > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=1793564f080000 > > > > > > IMPORTANT: if you fix the issue, please add the following tag to the commit: > > > Reported-by: syzbot+915f3e317adb0e85835f@syzkaller.appspotmail.com > > > > > > BUG: Bad page map in process syz-executor198 pte:8000000071c00227 pmd:74b30067 > > > addr:0000000020563000 vm_flags:08100077 anon_vma:ffff8880547d2200 mapping:0000000000000000 index:20563 > > > file:(null) fault:0x0 mmap:0x0 read_folio:0x0 > > > CPU: 1 PID: 3614 Comm: syz-executor198 Not tainted 6.0.0-rc3-next-20220901-syzkaller #0 > > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/26/2022 > > > Call Trace: > > > > > > __dump_stack lib/dump_stack.c:88 [inline] > > > dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106 > > > print_bad_pte.cold+0x2a7/0x2d0 mm/memory.c:565 > > > vm_normal_page+0x10c/0x2a0 mm/memory.c:636 > > > hpage_collapse_scan_pmd+0x729/0x1da0 mm/khugepaged.c:1199 > > > madvise_collapse+0x481/0x910 mm/khugepaged.c:2433 > > > madvise_vma_behavior+0xd0a/0x1cc0 mm/madvise.c:1062 > > > madvise_walk_vmas+0x1c7/0x2b0 mm/madvise.c:1236 > > > do_madvise.part.0+0x24a/0x340 mm/madvise.c:1415 > > > do_madvise mm/madvise.c:1428 [inline] > > > __do_sys_madvise mm/madvise.c:1428 [inline] > > > __se_sys_madvise mm/madvise.c:1426 [inline] > > > __x64_sys_madvise+0x113/0x150 mm/madvise.c:1426 > > > do_syscall_x64 arch/x86/entry/common.c:50 [inline] > > > do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 > > > entry_SYSCALL_64_after_hwframe+0x63/0xcd > > > RIP: 0033:0x7f770ba87929 > > > Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 11 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 > > > RSP: 002b:00007f770ba18308 EFLAGS: 00000246 ORIG_RAX: 000000000000001c > > > RAX: ffffffffffffffda RBX: 00007f770bb0f3f8 RCX: 00007f770ba87929 > > > RDX: 0000000000000019 RSI: 0000000000600003 RDI: 0000000020000000 > > > RBP: 00007f770bb0f3f0 R08: 00007f770ba18700 R09: 0000000000000000 > > > R10: 00007f770ba18700 R11: 0000000000000246 R12: 00007f770bb0f3fc > > > R13: 00007ffc2d8b62ef R14: 00007f770ba18400 R15: 0000000000022000 > > > > > > > I think I figured out the problem. The reproducer actually triggered > > the below race in madvise_collapse(): > > > > CPU A > > CPU B > > mmap 0x20000000 - 0x21000000 as anon > > > > madvise_collapse is called on this area > > > > Retrieve start and end address from the vma (NEVER updated > > later!) > > > > Collapsed the first 2M area and dropped mmap_lock > > Acquire mmap_lock > > mmap io_uring file at 0x20563000 > > Release mmap_lock > > > > Reacquire mmap_lock > > > > revalidate vma pass since 0x20200000 + 0x200000 > > > 0x20563000 > > > > scan the next 2M (0x20200000 - 0x20400000), but due to > > whatever reason it didn't release mmap_lock > > > > scan the 3rd 2M area (start from 0x20400000) > > > > actually scan the new vma created by io_uring since the > > end was never updated > > > > The below patch should be able to fix the problem (untested): > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > > index 5f7c60b8b269..e708c5d62325 100644 > > --- a/mm/khugepaged.c > > +++ b/mm/khugepaged.c > > @@ -2441,8 +2441,10 @@ int madvise_collapse(struct vm_area_struct > > *vma, struct vm_area_struct **prev, > > memset(cc->node_load, 0, sizeof(cc->node_load)); > > result = hpage_collapse_scan_pmd(mm, vma, addr, &mmap_locked, > > cc); > > - if (!mmap_locked) > > + if (!mmap_locked) { > > *prev = NULL; /* Tell caller we dropped mmap_lock */ > > + hend = vma->end & HPAGE_PMD_MASK; > > + } > > This is wrong. We should refetch the vma end after > hugepage_vma_revalidate() otherwise the vma is still the old one. > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index a3acd3e5e0f3..1860be232a26 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -2592,6 +2592,8 @@ int madvise_collapse(struct vm_area_struct *vma, > struct vm_area_struct **prev, > last_fail = result; > goto out_nolock; > } > + > + hend = vma->vm_end & HPAGE_PMD_MASK; > } > mmap_assert_locked(mm); > memset(cc->node_load, 0, sizeof(cc->node_load)); > > > > > > switch (result) { > > case SCAN_SUCCEED: > > > > Hey Yang, Thanks for triaging this, and apologies for intro'ing this bug. Also thank you for the repro explanation - I believe you are correct here. Generalizing the issue of: 1) hugepage_vma_revalidate() pmd X 2) collapse of pmd X doesn't drop mmap_lock 3) don't revalidate pmd X+1 4) attempt collapse of pmd X+1 I think the only problem is that hugepage_vma_revalidate() transhuge_vma_suitable() only checks if a single hugepage-sized/aligned region properly fits / is aligned in the VMA (i.e. the issue you found here). All other checks should be intrinsic to the VMA itself and should be safe to skip if mmap_lock isn't dropped since last hugepage_vma_revalidate(). As for the fix, I think your fix will work. If a VMA's size changes inside the main for-loop of madvise_collapse, then at some point we will lock mmap_lock and call hugepage_vma_revalidate(), which might fail itself if the next hugepage-aligned/sized region is now not contained in the VMA. By updating "hend" as you propose (i.e. using vma->m_end of the just-found VMA), we also ensure that for "addr" < "hend", the hugepage-aligned/sized region at "addr" will fit into the VMA. Note that we don't need to worry about the VMA being shrank from the other direction, so updating "hend" should be enough. I think the fix is fine as-is. I briefly thought a comment would be nice, but I think the code is self evident. The alternative is introing another transhuge_vma_suitable() call in the "if (!mmap_locked) { .. } else { .. }" failure path, but I think your approach is easier to read. Thanks again for taking the time to debug this, and hopefully I can be more careful in the future. Best, Zach Reviewed-by: Zach O'Keefe