Received: by 2002:a05:6358:11c7:b0:104:8066:f915 with SMTP id i7csp758695rwl; Wed, 5 Apr 2023 07:22:56 -0700 (PDT) X-Google-Smtp-Source: AKy350ZvzAec2CXKVXIvCgEGyawJIT1DUoW7jzf2+ZBV0y6E5I2/Np9srPR/AVeo9glEUhauPm9t X-Received: by 2002:a17:902:db10:b0:1a1:ab92:5c88 with SMTP id m16-20020a170902db1000b001a1ab925c88mr8207764plx.13.1680704576734; Wed, 05 Apr 2023 07:22:56 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1680704576; cv=none; d=google.com; s=arc-20160816; b=jEamvoxfz+P613KZhjZ5mbyCmN0Ox/pJIjoh1Vjxjr3ImqSqLC/4tRsgYr0lrJXCW9 vVO8pGOwrSgP7iIM6bw/pB/NLjm0CrYS58rP1sNdx/tJ/FTLrHGtUTLOQxHkIHAYm0lK 8SVrDGAMB3VJpK14M4fOk61b5/IuDsPDbCc83tL10eFzjtInSbZ4berrYJEmZ9jsoI08 92j49xVE7LLig6RPXPd1RTlNndPWDxnKGzd6k3XH5ljpT5YU/28B5Es9/2nT0so4YUhE G6BKdtmVnzSnn1pMoHjnhjAsogFyFdDpeTWiIdL0zuFj0j8RAk/mKnS1l2qxfOZ7ClQZ HIKg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :content-language:references:cc:to:subject:user-agent:mime-version :date:message-id:dkim-signature:dkim-signature; bh=TwzVFcW7sE+w3bNrVql2K+XcKwi6CmdDhydZgcJtnrg=; b=pLXgK8uZEIn5JVHgd01P6j8GFPfNoyQB8KvF6A7kiw/6I19mBjpH/aASKzi7TNgaa8 dXR5Ydew1nYZ9Uc1dmCL+ZIoq+LuXO+7OeW/r6yXbeOPiqTn0Tv/W0OKENXG4Pv5TJ8t xtQ+iSJ11/lZVo8EIc6SdN2ohg+2I3GfpA5fjLaonXnLkSu3NJJF3MUhTk+wa+UzCwyk lpGE8Popeofalkp1uzYBOFOQUl/Tefz2cultM3QoVL3oNiZ9UQafn88fbk8g91kWg+65 CwQwBnj6k3e7S67i6IcHMGfzdlqmhFRgZKgFahwt4dK+pEnjx3+P3VQap3c3rwt52jFM PDKg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@suse.cz header.s=susede2_rsa header.b=eG20DX7c; dkim=neutral (no key) header.i=@suse.cz header.s=susede2_ed25519 header.b="80oB9n/F"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id y9-20020a631809000000b0050f817c6c82si9597755pgl.232.2023.04.05.07.22.44; Wed, 05 Apr 2023 07:22:56 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@suse.cz header.s=susede2_rsa header.b=eG20DX7c; dkim=neutral (no key) header.i=@suse.cz header.s=susede2_ed25519 header.b="80oB9n/F"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237935AbjDEOU0 (ORCPT + 99 others); Wed, 5 Apr 2023 10:20:26 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43076 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238488AbjDEOT6 (ORCPT ); Wed, 5 Apr 2023 10:19:58 -0400 Received: from smtp-out1.suse.de (smtp-out1.suse.de [IPv6:2001:67c:2178:6::1c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 58D3672A7; Wed, 5 Apr 2023 07:19:02 -0700 (PDT) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id 075312298B; Wed, 5 Apr 2023 14:18:59 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1680704339; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=TwzVFcW7sE+w3bNrVql2K+XcKwi6CmdDhydZgcJtnrg=; b=eG20DX7c4/v1CsN7+ArULNXpRSTSaKNuFTf1Vav2mw+Bi9RxvFpWbeejAs3arXTRSgEXcc 87GvHJ2kTrdvaDPXeQzVrExJVutL83RXg/8UQpldWtH450bQjTPM6/0x1Pzu5kl/1iDbun 108lhRYdFBdm6rUJqOuL5dcYomWs3SM= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1680704339; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=TwzVFcW7sE+w3bNrVql2K+XcKwi6CmdDhydZgcJtnrg=; b=80oB9n/FCTRxxxchFF7o0nzCgdzF0yoJaAY1+xya1LZ2Fk1S8AkXVS75k/s9bJwr8nWlRn Z9P8FqG0YA0onwAw== Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id D79C713A31; Wed, 5 Apr 2023 14:18:58 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id vIkKNFKDLWR5BgAAMHmgww (envelope-from ); Wed, 05 Apr 2023 14:18:58 +0000 Message-ID: Date: Wed, 5 Apr 2023 16:18:58 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.9.0 Subject: Re: [PATCH] mm, page_alloc: reduce page alloc/free sanity checks To: Mel Gorman Cc: Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org, patches@lists.linux.dev, Kees Cook , linux-hardening@vger.kernel.org, Alexander Halbuer References: <20230216095131.17336-1-vbabka@suse.cz> <20230405124519.ir7y54aunmyg3tcn@techsingularity.net> Content-Language: en-US From: Vlastimil Babka In-Reply-To: <20230405124519.ir7y54aunmyg3tcn@techsingularity.net> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-2.9 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,DKIM_VALID_EF,NICE_REPLY_A,RCVD_IN_DNSWL_MED, SPF_HELO_NONE,SPF_SOFTFAIL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 4/5/23 14:45, Mel Gorman wrote: > On Thu, Feb 16, 2023 at 10:51:31AM +0100, Vlastimil Babka wrote: >> Historically, we have performed sanity checks on all struct pages being >> allocated or freed, making sure they have no unexpected page flags or >> certain field values. This can detect insufficient cleanup and some >> cases of use-after-free, although on its own it can't always identify >> the culprit. The result is a warning and the "bad page" being leaked. >> >> The checks do need some cpu cycles, so in 4.7 with commits 479f854a207c >> ("mm, page_alloc: defer debugging checks of pages allocated from the >> PCP") and 4db7548ccbd9 ("mm, page_alloc: defer debugging checks of freed >> pages until a PCP drain") they were no longer performed in the hot paths >> when allocating and freeing from pcplists, but only when pcplists are >> bypassed, refilled or drained. For debugging purposes, with >> CONFIG_DEBUG_VM enabled the checks were instead still done in the >> hot paths and not when refilling or draining pcplists. >> >> With 4462b32c9285 ("mm, page_alloc: more extensive free page checking >> with debug_pagealloc"), enabling debug_pagealloc also moved the sanity >> checks back to hot pahs. When both debug_pagealloc and CONFIG_DEBUG_VM >> are enabled, the checks are done both in hotpaths and pcplist >> refill/drain. >> >> Even though the non-debug default today might seem to be a sensible >> tradeoff between overhead and ability to detect bad pages, on closer >> look it's arguably not. As most allocations go through the pcplists, >> catching any bad pages when refilling or draining pcplists has only a >> small chance, insufficient for debugging or serious hardening purposes. >> On the other hand the cost of the checks is concentrated in the already >> expensive drain/refill batching operations, and those are done under the >> often contended zone lock. That was recently identified as an issue for >> page allocation and the zone lock contention reduced by moving the >> checks outside of the locked section with a patch "mm: reduce lock >> contention of pcp buffer refill", but the cost of the checks is still >> visible compared to their removal [1]. In the pcplist draining path >> free_pcppages_bulk() the checks are still done under zone->lock. >> >> Thus, remove the checks from pcplist refill and drain paths completely. >> Introduce a static key check_pages_enabled to control checks during page >> allocation a freeing (whether pcplist is used or bypassed). The static >> key is enabled if either is true: >> - kernel is built with CONFIG_DEBUG_VM=y (debugging) >> - debug_pagealloc or page poisoning is boot-time enabled (debugging) >> - init_on_alloc or init_on_free is boot-time enabled (hardening) >> >> The resulting user visible changes: >> - no checks when draining/refilling pcplists - less overhead, with >> likely no practical reduction of ability to catch bad pages >> - no checks when bypassing pcplists in default config (no >> debugging/hardening) - less overhead etc. as above >> - on typical hardened kernels [2], checks are now performed on each page >> allocation/free (previously only when bypassing/draining/refilling >> pcplists) - the init_on_alloc/init_on_free enabled should be sufficient >> indication for preferring more costly alloc/free operations for >> hardening purposes and we shouldn't need to introduce another toggle >> - code (various wrappers) removal and simplification >> >> [1] https://lore.kernel.org/all/68ba44d8-6899-c018-dcb3-36f3a96e6bea@sra.uni-hannover.de/ >> [2] https://lore.kernel.org/all/63ebc499.a70a0220.9ac51.29ea@mx.google.com/ >> >> Reported-by: Alexander Halbuer >> Reported-by: Andrew Morton >> Signed-off-by: Vlastimil Babka > > Acked-by: Mel Gorman Thanks. > Some minor comments below > >> @@ -1432,9 +1448,11 @@ static __always_inline bool free_pages_prepare(struct page *page, >> for (i = 1; i < (1 << order); i++) { >> if (compound) >> bad += free_tail_pages_check(page, page + i); > > free_tail_pages_check is also a function that only does something useful > when CONFIG_DEBUG_VM is set. While it might be outside the scope of the > patch, it might also benefit from check_pages_enabled checks? True, will send a followup. Will also rename it to free_tail_page_prepare() as it in fact also combines a preparation component with optional checks component. Will remove the unlikely()s you pointed out as well. >> - if (unlikely(free_page_is_bad(page + i))) { >> - bad++; >> - continue; >> + if (static_branch_unlikely(&check_pages_enabled)) { >> + if (unlikely(free_page_is_bad(page + i))) { >> + bad++; >> + continue; >> + } >> } >> (page + i)->flags &= ~PAGE_FLAGS_CHECK_AT_PREP; >> } > > The unlikely() within a static_branch_unlikely probably adds very little > given the block is so tiny. > >> @@ -2392,56 +2369,20 @@ static inline int check_new_page(struct page *page) >> return 1; >> } >> >> -static bool check_new_pages(struct page *page, unsigned int order) >> +static inline bool check_new_pages(struct page *page, unsigned int order) >> { >> - int i; >> - for (i = 0; i < (1 << order); i++) { >> - struct page *p = page + i; >> + if (static_branch_unlikely(&check_pages_enabled)) { >> + for (int i = 0; i < (1 << order); i++) { >> + struct page *p = page + i; >> >> - if (unlikely(check_new_page(p))) >> - return true; >> + if (unlikely(check_new_page(p))) >> + return true; >> + } >> } >> > > unlikely() within static_branch_unlikely probably adds very little. > > Otherwise, looks good. I agree that with changes over time that the ability > of the checks to detect anything is reduced and it's probably at the point > where it can only detect a very specific bit corruption instead of broken > code. Commit 44042b449872 ("mm/page_alloc: allow high-order pages to be > stored on the per-cpu lists") also likely reduced the ability of the checks > to find anything. >