Received: by 2002:a05:6a10:6d10:0:0:0:0 with SMTP id gq16csp900386pxb; Fri, 22 Apr 2022 13:49:37 -0700 (PDT) X-Google-Smtp-Source: ABdhPJz/9FsX6b56e4HSe9dKue+9upaSMJ7/9zj1r/AO8Gk616eLCr2zcCoyDD68Jk9Wzh3dUCNz X-Received: by 2002:a17:90b:2788:b0:1d2:959d:3e96 with SMTP id pw8-20020a17090b278800b001d2959d3e96mr7288011pjb.211.1650660577182; Fri, 22 Apr 2022 13:49:37 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1650660577; cv=none; d=google.com; s=arc-20160816; b=PQPW/3dzeciMztOJpUhDB+YrF/mK1vTxzxOfpkyCQGs3r/bjy4iS0KvM+PWfwe2rl4 CiDsFO21QuwpvkPoXA4OYgz3TPoIE2pYQzv8DkHiag0IJjKije/N/Vp75BRuihDiMGzO V7bi6xXeEcHkCQ57Qcpnd3Z2c3Yo5bw2zoiEjVpfgczDEZODoULi44n8xopBzMQqIMs6 JqBlC/kh1yLhhSwPVIlQ7dwsBDy0jVFS3YKMUhHzYyxFrculcdKQuPERuyC7L4MdElOa 0oHYhBuDYGkzeNYCsxrWHDZthJLtILRzXSyrRGhX+BAfM24nJ9MEb10qHNxdST1RBRsD qs7g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=tq0gUUOMPJMLxttHpNGS8+gMeVcDjtxkcnk9TZueVl4=; b=QXxF2fIKOQW7+1LAxPxmYQIxQ33SodhUYWD+4GZkBPTvExlZkDtluEyrI9ep5cw9Rg iT6BIMlEStHWSBvDw5b1G69vKdZztnwEDlNgxJsoz3VDXel6UwAadd/cPLn1YHDg1Olb QAsFli7a2PZ+4NTYjFSjA4JDKc2yyNDatQ1K8G3FT1Q0G8mJ6MpC4p41ba98NQmoSRcD pxYoPhbo2h78F3kxK+4Vp2OA7fyHz1xLuGYax8y5znpqMu5EGyXBgJPOGZ10c3bzOGkE blDTkZezgwzW4ezi8Ag4bxT8evjgDhrY/FVvfBx7vvH61HXzdzS5SE3J38m5+U9+DXt5 JOeQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=i7mqUboZ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [2620:137:e000::1:18]) by mx.google.com with ESMTPS id q17-20020a056a00089100b004fe3d0e5d61si9131831pfj.359.2022.04.22.13.49.36 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 22 Apr 2022 13:49:37 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) client-ip=2620:137:e000::1:18; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=i7mqUboZ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 4B7A12EEAD9; Fri, 22 Apr 2022 12:40:18 -0700 (PDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1351912AbiDSSpS (ORCPT + 99 others); Tue, 19 Apr 2022 14:45:18 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55660 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231219AbiDSSpP (ORCPT ); Tue, 19 Apr 2022 14:45:15 -0400 Received: from ams.source.kernel.org (ams.source.kernel.org [145.40.68.75]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 10E1E37A03; Tue, 19 Apr 2022 11:42:32 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ams.source.kernel.org (Postfix) with ESMTPS id 88A66B81A0A; Tue, 19 Apr 2022 18:42:30 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 40B93C385A5; Tue, 19 Apr 2022 18:42:22 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1650393749; bh=aNoitcKVY+3runARZOKXlbXNRTPUSLUckf8FikGfbmg=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=i7mqUboZvJQO3DEpHzHUrQS1XU/Xq0/4kVpXYnuX/EY5TEq79O7i3JpbTNV1ank/D 3rovYl/JOrQRyL69n00/5WqcUh3N139NbOjfw9Vws7EtwFZbDgwrTGvU7g12xQdW9p LxvxYyiDo1i7sTAkI1G+Hi3Og+ZHF3qXCBODPFufpxUraX0F2M3ibEylB05sF3c3pT IXLgMLDp22fz/AeZ6W3BVY3YB4X5riLJN3ir8YrgRXD6yRtDtFvwS+QvfTKkG+K6K/ PF0XoUq4nMBSZz9alDletCJq+qESl0CTcWNQEq9IWjq1c57cap7A7bR/lvO+m4qv/s ASgkvryZLfG/w== Date: Tue, 19 Apr 2022 21:42:17 +0300 From: Mike Rapoport To: Song Liu Cc: "Edgecombe, Rick P" , "mcgrof@kernel.org" , "linux-kernel@vger.kernel.org" , "bpf@vger.kernel.org" , "hch@infradead.org" , "ast@kernel.org" , "daniel@iogearbox.net" , "Torvalds, Linus" , "linux-mm@kvack.org" , "song@kernel.org" , Kernel Team , "pmladek@suse.com" , "akpm@linux-foundation.org" , "hpa@zytor.com" , "dborkman@redhat.com" , "edumazet@google.com" , "bp@alien8.de" , "mbenes@suse.cz" , "imbrenda@linux.ibm.com" Subject: Re: [PATCH v4 bpf 0/4] vmalloc: bpf: introduce VM_ALLOW_HUGE_VMAP Message-ID: References: <20220415164413.2727220-1-song@kernel.org> <4AD023F9-FBCE-4C7C-A049-9292491408AA@fb.com> <88eafc9220d134d72db9eb381114432e71903022.camel@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-2.9 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,MAILING_LIST_MULTI, RDNS_NONE,SPF_HELO_NONE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, On Tue, Apr 19, 2022 at 05:36:45AM +0000, Song Liu wrote: > Hi Mike, Luis, and Rick, > > Thanks for sharing your work and findings in the space. I didn't > realize we were looking at the same set of problems. > > > On Apr 18, 2022, at 6:56 PM, Edgecombe, Rick P wrote: > > > > On Mon, 2022-04-18 at 17:44 -0700, Luis Chamberlain wrote: > >>> There are use-cases that require 4K pages with non-default > >>> permissions in > >>> the direct map and the pages not necessarily should be executable. > >>> There > >>> were several suggestions to implement caches of 4K pages backed by > >>> 2M > >>> pages. > >> > >> Even if we just focus on the executable side of the story... there > >> may > >> be users who can share this too. > >> > >> I've gone down memory lane now at least down to year 2005 in kprobes > >> to see why the heck module_alloc() was used. At first glance there > >> are > >> some old comments about being within the 2 GiB text kernel range... > >> But > >> some old tribal knowledge is still lost. The real hints come from > >> kprobe work > >> since commit 9ec4b1f356b3 ("[PATCH] kprobes: fix single-step out of > >> line > >> - take2"), so that the "For the %rip-relative displacement fixups to > >> be > >> doable"... but this got me wondering, would other users who *do* want > >> similar funcionality benefit from a cache. If the space is limited > >> then > >> using a cache makes sense. Specially if architectures tend to require > >> hacks for some of this to all work. > > > > Yea, that was my understanding. X86 modules have to be linked within > > 2GB of the kernel text, also eBPF x86 JIT generates code that expects > > to be within 2GB of the kernel text. > > > > > > I think of two types of caches we could have: caches of unmapped pages > > on the direct map and caches of virtual memory mappings. Caches of > > pages on the direct map reduce breakage of the large pages (and is > > somewhat x86 specific problem). Caches of virtual memory mappings > > reduce shootdowns, and are also required to share huge pages. I'll plug > > my old RFC, where I tried to work towards enabling both: > > > > https://lore.kernel.org/lkml/20201120202426.18009-1-rick.p.edgecombe@intel.com/ > > > > Since then Mike has taken a lot further the direct map cache piece. > > These are really interesting work. With this landed, we won't need > the bpf_prog_pack work at all (I think). OTOH, this looks like a > long term project, as some of the work in bpf_prog_pack took quite > some time to discuss/debate, and that was just a subset of the > whole thing. I'd say that bpf_prog_pack was a cure for symptoms and this project tries to address more general problem. But you are right, it'll take some time and won't land in 5.19. > I really like the two types of cache concept. But there are some > details I cannot figure out about them: After some discussions we decided to try moving the caching of large pages to the page allocator and see if the second cache will be needed at all. But I've got distracted after posting the RFC and that work didn't have real progress since then. > 1. Is "caches of unmapped pages on direct map" (cache #1) > sufficient to fix all direct map fragmentation? IIUC, pages in > the cache may still be used by other allocation (with some > memory pressure). If the system runs for long enough, there > may be a lot of direct map fragmentation. Is this right? If the system runs long enough, it may run out of high-order free pages regardless of the way the caches are implemented. Then we either fail the allocation because it is impossible to refill the cache with large pages or fall back to 4k pages and fragment direct map. I don't see how can we avoid direct map fragmentation entirely and still be able to allocate memory for users of set_memory APIs. > 2. If we have "cache of virtual memory mappings" (cache #2), do we > still need cache #1? I know cache #2 alone may waste some > memory, but I still think 2MB within noise for modern systems. I presume that by cache #1 you mean the cache in the page allocator. In that case cache #2 is probably not needed at all, because the cache at page allocator level will be used by vmalloc() and friends to provide what Rick called "permissioned allocations". > Thanks, > Song -- Sincerely yours, Mike.