Received: by 2002:ab2:69cc:0:b0:1f4:be93:e15a with SMTP id n12csp1916858lqp; Tue, 16 Apr 2024 01:26:16 -0700 (PDT) X-Forwarded-Encrypted: i=3; AJvYcCVm4RYdSdBjhORuEeclcqVxioJfsbr1MI/NzbSPgixFmj32rSKpbErEj3gbLmlM50QKJ5yl+Y456q0qERaJS1FTbm4GSoqb1RhCvwdHJg== X-Google-Smtp-Source: AGHT+IHps71G6nd0UUiBGJ93AenlGBtja/vsNjJiIWE2taZl2eMqd5P08AzquOxKQuiAQ0lE6a07 X-Received: by 2002:a50:cd5b:0:b0:56e:d9e:f4d3 with SMTP id d27-20020a50cd5b000000b0056e0d9ef4d3mr10452358edj.18.1713255976509; Tue, 16 Apr 2024 01:26:16 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1713255976; cv=pass; d=google.com; s=arc-20160816; b=iwNdxya5xsX20Y3rawPly5/9wk54u40n6s9cKEZ4r9ITXfoAFZSXlMJNphePXWOMJe Rbh/sT7aQZsjkwP61tqpA264hX/OttMYKGofpxEorPYnGrsGf2kE5KhhEtpvs01XmTZb YBzlkDbbHWBea3hlFoF/hd3vJ4wnXCV/hKoq3X+IOds1EeSlHcipmctAk+HtDaq1KPA2 lwG9h2Y92Yr98ASKfQk9XUglXoDvicAvcps62THzPR3Pz1bSI8DdNkm/6Fh6ELme56Ra hy3xdXQx+pGKX3NWJ6SvBAPJREh20fTX42IycE7R0+iwkhp7tnNcPyyxrfvn3Dkh4634 1IDg== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:list-unsubscribe:list-subscribe:list-id:precedence :message-id:subject:cc:to:from:date:dkim-signature; bh=BR9TDn8xUYQur1kcZ0eRovR6sDJ0QgezliekaO8oFAw=; fh=gXwJqegFLkqTkMQpuhA7EhHjjD58erLb3sOgsKXKg2U=; b=pLQmah2XMO5VKUWIniALOKOq1m3T2KXhqETfSEO+UOKwh07jMgMF0z2vtZmlI9WAja IBKT1wG5QEfoLkMq1x8hIZiRb1cMnDMTf0b4r+Mk4UvLParbfoThOc6Ex3DzZx3dvOi2 7ZsROC36W2Pjmnei8iXYGAH/yhUMo2sbV2IewoAnGSE+qSLoxV7EAbKXhl/49r3gcHNc u/A2eSDbkolvm6zeLXwFUdzxVOjJP8Jvn0weRTpt4G8U0mjIuE2B9KEo5k25lkONt7Np yZ6SKvfflgsj7h3tab+bJ4zVL70/F2Yl9EExurdrdi+GLfLtDahEhQkeDl4lPWRSRNMT qxVA==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=WKHWYhR5; arc=pass (i=1 dkim=pass dkdomain=kernel.org); spf=pass (google.com: domain of linux-ext4+bounces-2097-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) smtp.mailfrom="linux-ext4+bounces-2097-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from am.mirrors.kernel.org (am.mirrors.kernel.org. [147.75.80.249]) by mx.google.com with ESMTPS id g19-20020a50d5d3000000b00568386368c5si5301269edj.690.2024.04.16.01.26.16 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 16 Apr 2024 01:26:16 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-ext4+bounces-2097-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) client-ip=147.75.80.249; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=WKHWYhR5; arc=pass (i=1 dkim=pass dkdomain=kernel.org); spf=pass (google.com: domain of linux-ext4+bounces-2097-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) smtp.mailfrom="linux-ext4+bounces-2097-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by am.mirrors.kernel.org (Postfix) with ESMTPS id 17CC31F237B8 for ; Tue, 16 Apr 2024 08:26:16 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id DD53C156240; Tue, 16 Apr 2024 08:26:06 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="WKHWYhR5" X-Original-To: linux-ext4@vger.kernel.org Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5BF393B78D; Tue, 16 Apr 2024 08:26:06 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1713255966; cv=none; b=AgkhBHO6IKtr5SP7nKddLXqBI3HX7TDv7wElh8eVYzYrMo5gfZlGwiVOELih2ZHxBjax2wXWJgLAGvG7I+6+87Z8C22lALA6kYEdC1QRP4c0BzR1VWWNEqvrcqDCEEvXtYQqWr17+e83cwejmEwz3JsFJdoU1/nWY/7kyqfsXlQ= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1713255966; c=relaxed/simple; bh=dU7PkEFRm93LBS1Z28p9N1XSH6G5tnTgOH3s9Y4nZt4=; h=Date:From:To:Cc:Subject:Message-ID:MIME-Version:Content-Type: Content-Disposition:In-Reply-To; b=uXzfKjBalvINkwB0RIsdxA/Exytxc5/UPmmPYtpIiaxLJvvaO6yM6vZUUxv17t+hgE9i8nymoR2eWMOuKVrznKB0Zt6SkJYoKK2m3DywRbY5xpFQ9UFklm7SLCfX2SuY3XtjeZGGqJSRcCd9XmXa3f3heYQMhbQjbnsAQdMhVf4= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=WKHWYhR5; arc=none smtp.client-ip=10.30.226.201 Received: by smtp.kernel.org (Postfix) with ESMTPSA id B3887C113CE; Tue, 16 Apr 2024 08:26:01 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1713255965; bh=dU7PkEFRm93LBS1Z28p9N1XSH6G5tnTgOH3s9Y4nZt4=; h=Date:From:To:Cc:Subject:In-Reply-To:From; b=WKHWYhR5mbbNVzGbhCPf7kl3oB3jDL3oHMn4Q3PKGa5PZXdSv3F82AMPzTnllqrB4 eZBDrqWP3HqjL+d8FxqNuISmb9y7y4WCaIE1AKWLLrnaaF0NASBwhgLGvWRYLecpTO ttEfbAtfQ8AJvE0g7ukarqYG4cu6ptpkcv9VPL9XAGDJtrUvUjgz3YtAF7LDMUgrDa s/zksYEOphxkiemLTBFR2/kTS9c5z3U4A2SCiAAxhWpDiRw6Bwmz98kZLETHOr9Bg7 eyAiL1mqWCvwl0QT9oRiiwzTJzltC9q8nqukrrPuz9rMas7Qhl8FYpTAxdzmV9SdoI 9XFSIr6NmMG1Q== Date: Tue, 16 Apr 2024 10:25:58 +0200 From: Christian Brauner To: =?utf-8?B?QmrDtnJuIFTDtnBlbA==?= , Nam Cao , Mike Rapoport Cc: Andreas Dilger , Al Viro , linux-fsdevel , Jan Kara , Linux Kernel Mailing List , linux-riscv@lists.infradead.org, Theodore Ts'o , Ext4 Developers List , Conor Dooley , "Matthew Wilcox (Oracle)" , Anders Roxell Subject: Re: riscv32 EXT4 splat, 6.8 regression? Message-ID: <20240416-deppen-gasleitung-8098fcfd6bbd@brauner> Precedence: bulk X-Mailing-List: linux-ext4@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <87le5e393x.fsf@all.your.base.are.belong.to.us> <20240416084417.569356d3@namcao> [Adding Mike who's knowledgeable in this area] On Mon, Apr 15, 2024 at 06:04:50PM +0200, Björn Töpel wrote: > Christian Brauner writes: > > > On Sun, Apr 14, 2024 at 04:08:11PM +0200, Björn Töpel wrote: > >> Andreas Dilger writes: > >> > >> > On Apr 13, 2024, at 8:15 PM, Al Viro wrote: > >> >> > >> >> On Sat, Apr 13, 2024 at 07:46:03PM -0600, Andreas Dilger wrote: > >> >> > >> >>> As to whether the 0xfffff000 address itself is valid for riscv32 is > >> >>> outside my realm, but given that RAM is cheap it doesn't seem unlikely > >> >>> to have 4GB+ of RAM and want to use it all. The riscv32 might consider > >> >>> reserving this page address from allocation to avoid similar issues in > >> >>> other parts of the code, as is done with the NULL/0 page address. > >> >> > >> >> Not a chance. *Any* page mapped there is a serious bug on any 32bit > >> >> box. Recall what ERR_PTR() is... > >> >> > >> >> On any architecture the virtual addresses in range (unsigned long)-512.. > >> >> (unsigned long)-1 must never resolve to valid kernel objects. > >> >> In other words, any kind of wraparound here is asking for an oops on > >> >> attempts to access the elements of buffer - kernel dereference of > >> >> (char *)0xfffff000 on a 32bit box is already a bug. > >> >> > >> >> It might be getting an invalid pointer, but arithmetical overflows > >> >> are irrelevant. > >> > > >> > The original bug report stated that search_buf = 0xfffff000 on entry, > >> > and I'd quoted that at the start of my email: > >> > > >> > On Apr 12, 2024, at 8:57 AM, Björn Töpel wrote: > >> >> What I see in ext4_search_dir() is that search_buf is 0xfffff000, and at > >> >> some point the address wraps to zero, and boom. I doubt that 0xfffff000 > >> >> is a sane address. > >> > > >> > Now that you mention ERR_PTR() it definitely makes sense that this last > >> > page HAS to be excluded. > >> > > >> > So some other bug is passing the bad pointer to this code before this > >> > error, or the arch is not correctly excluding this page from allocation. > >> > >> Yeah, something is off for sure. > >> > >> (FWIW, I manage to hit this for Linus' master as well.) > >> > >> I added a print (close to trace_mm_filemap_add_to_page_cache()), and for > >> this BT: > >> > >> [] __filemap_add_folio+0x322/0x508 > >> [] filemap_add_folio+0x54/0xce > >> [] __filemap_get_folio+0x156/0x2aa > >> [] __getblk_slow+0xcc/0x302 > >> [] bdev_getblk+0x76/0x7a > >> [] ext4_getblk+0xbc/0x2c4 > >> [] ext4_bread_batch+0x56/0x186 > >> [] __ext4_find_entry+0x156/0x578 > >> [] ext4_lookup+0x86/0x1f4 > >> [] __lookup_slow+0x8e/0x142 > >> [] walk_component+0x104/0x174 > >> [] path_lookupat+0x78/0x182 > >> [] filename_lookup+0x96/0x158 > >> [] kern_path+0x38/0x56 > >> [] init_mount+0x5c/0xac > >> [] devtmpfs_mount+0x44/0x7a > >> [] prepare_namespace+0x226/0x27c > >> [] kernel_init_freeable+0x286/0x2a8 > >> [] kernel_init+0x2a/0x156 > >> [] ret_from_fork+0xe/0x20 > >> > >> I get a folio where folio_address(folio) == 0xfffff000 (which is > >> broken). > >> > >> Need to go into the weeds here... > > > > I don't see anything obvious that could explain this right away. Did you > > manage to reproduce this on any other architecture and/or filesystem? > > > > Fwiw, iirc there were a bunch of fs/buffer.c changes that came in > > through the mm/ layer between v6.7 and v6.8 that might also be > > interesting. But really I'm poking in the dark currently. > > Thanks for getting back! Spent some more time one it today. > > It seems that the buddy allocator *can* return a page with a VA that can > wrap (0xfffff000 -- pointed out by Nam and myself). > > Further, it seems like riscv32 indeed inserts a page like that to the > buddy allocator, when the memblock is free'd: > > | [] __free_one_page+0x2a4/0x3ea > | [] __free_pages_ok+0x158/0x3cc > | [] __free_pages_core+0xe8/0x12c > | [] memblock_free_pages+0x1a/0x22 > | [] memblock_free_all+0x1ee/0x278 > | [] mem_init+0x10/0xa4 > | [] mm_core_init+0x11a/0x2da > | [] start_kernel+0x3c4/0x6de > > Here, a page with VA 0xfffff000 is a added to the freelist. We were just > lucky (unlucky?) that page was used for the page cache. > > A nasty patch like: > --8<-- > diff --git a/mm/mm_init.c b/mm/mm_init.c > index 549e76af8f82..a6a6abbe71b0 100644 > --- a/mm/mm_init.c > +++ b/mm/mm_init.c > @@ -2566,6 +2566,9 @@ void __init set_dma_reserve(unsigned long new_dma_reserve) > void __init memblock_free_pages(struct page *page, unsigned long pfn, > unsigned int order) > { > + if ((long)page_address(page) == 0xfffff000L) { > + return; // leak it > + } > > if (IS_ENABLED(CONFIG_DEFERRED_STRUCT_PAGE_INIT)) { > int nid = early_pfn_to_nid(pfn); > --8<-- > > ...and it's gone. > > I need to think more about what a proper fix is. Regardless; Christian, > Al, and Ted can all relax. ;-) > > > Björn On Tue, Apr 16, 2024 at 08:44:17AM +0200, Nam Cao wrote: > On 2024-04-15 Björn Töpel wrote: > > Thanks for getting back! Spent some more time one it today. > > > > It seems that the buddy allocator *can* return a page with a VA that can > > wrap (0xfffff000 -- pointed out by Nam and myself). > > > > Further, it seems like riscv32 indeed inserts a page like that to the > > buddy allocator, when the memblock is free'd: > > > > | [] __free_one_page+0x2a4/0x3ea > > | [] __free_pages_ok+0x158/0x3cc > > | [] __free_pages_core+0xe8/0x12c > > | [] memblock_free_pages+0x1a/0x22 > > | [] memblock_free_all+0x1ee/0x278 > > | [] mem_init+0x10/0xa4 > > | [] mm_core_init+0x11a/0x2da > > | [] start_kernel+0x3c4/0x6de > > > > Here, a page with VA 0xfffff000 is a added to the freelist. We were just > > lucky (unlucky?) that page was used for the page cache. > > I just educated myself about memory mapping last night, so the below > may be complete nonsense. Take it with a grain of salt. > > In riscv's setup_bootmem(), we have this line: > max_low_pfn = max_pfn = PFN_DOWN(phys_ram_end); > > I think this is the root cause: max_low_pfn indicates the last page > to be mapped. Problem is: nothing prevents PFN_DOWN(phys_ram_end) from > getting mapped to the last page (0xfffff000). If max_low_pfn is mapped > to the last page, we get the reported problem. > > There seems to be some code to make sure the last page is not used > (the call to memblock_set_current_limit() right above this line). It is > unclear to me why this still lets the problem slip through. > > The fix is simple: never let max_low_pfn gets mapped to the last page. > The below patch fixes the problem for me. But I am not entirely sure if > this is the correct fix, further investigation needed. > > Best regards, > Nam > > diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c > index fa34cf55037b..17cab0a52726 100644 > --- a/arch/riscv/mm/init.c > +++ b/arch/riscv/mm/init.c > @@ -251,7 +251,8 @@ static void __init setup_bootmem(void) > } > > min_low_pfn = PFN_UP(phys_ram_base); > - max_low_pfn = max_pfn = PFN_DOWN(phys_ram_end); > + max_low_pfn = PFN_DOWN(memblock_get_current_limit()); > + max_pfn = PFN_DOWN(phys_ram_end); > high_memory = (void *)(__va(PFN_PHYS(max_low_pfn))); > > dma32_phys_limit = min(4UL * SZ_1G, (unsigned long)PFN_PHYS(max_low_pfn));