Received: by 2002:a05:6a10:6d10:0:0:0:0 with SMTP id gq16csp226402pxb; Wed, 20 Apr 2022 20:58:05 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyizdsmuiy2wKdf1hkj1BkdA+XsjLh9yYrsXmZEkQvn/Ctqa6b1gmW/xV34dUTodLbrDOaa X-Received: by 2002:a05:6402:5252:b0:423:dba1:3a32 with SMTP id t18-20020a056402525200b00423dba13a32mr22657284edd.269.1650513485352; Wed, 20 Apr 2022 20:58:05 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1650513485; cv=none; d=google.com; s=arc-20160816; b=zUZlimrPrTCFliP07qyJnP4ujU7NovGUV7i5cIJhsgTutP5k6d3fxijwC4QLcx4gU8 8l3VJ+iYc0+tdaZiyEBQ3eRDJ9ZB4F419Jr/vYwBeA4UHrxmfthcgmy2AgyoJbkj3gPc vGpevJcU47NrnZkPDAYOfxHoLX7OOsqsHgxoGjVTQSjlgYoXolSFVZfRmnVejHaCZ6Sn 71NBNBsEGwbZ3EuX/ZiP/KKwCeUC5yjKQ0yxkkfIkTFB4kQ5RbttHzKa1o/nZVu95AcL u7BABW6bGIop2XLK3r+hljuv9ESSj1OAy955UeuScrAuC7m04aMu8ag9ZB0sPGK7VBdZ 1xbA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :dkim-signature; bh=F6kyX/KE6uI93AUNHdOTXTCcOPf0qyiNPQDSWpMU/TM=; b=RMuziH1HWMHA2GUJGM+MKbhdXvxE536hTLvW3891YhTwwPP2xdangN5Bwcd6/C6Vz/ J3d8ndsGJn/QKIIxzcrUpmK1XIizz44ZGozF9Xd8LKrnmWejUaGULmo/Myb7/0yRxE+Z oy/fg3tLr3iubTt1/U+xHijnvk3QRwfaD+hc8bHFnQU8f8+8P6KWNxSL0NctQnhbohJ8 nnewXXD0S+40CiXQNQBa/Tl4TQ9ujaF/KcqnOCCpekfoPw8QpRiABXPZDKRz8vQp1Wui 64SGgLc98RNWJY8+K8OMog0wIiFQcj0+ZhRr21ZUvarmDfGV/7XVEUu1DwgD/zni0jCI CrlA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@infradead.org header.s=bombadil.20210309 header.b="b/bYxKGF"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id h6-20020a05640250c600b00423709edc7esi3025823edb.529.2022.04.20.20.57.31; Wed, 20 Apr 2022 20:58:05 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@infradead.org header.s=bombadil.20210309 header.b="b/bYxKGF"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S241442AbiDSAuX (ORCPT + 99 others); Mon, 18 Apr 2022 20:50:23 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41178 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S243271AbiDSAsR (ORCPT ); Mon, 18 Apr 2022 20:48:17 -0400 Received: from bombadil.infradead.org (bombadil.infradead.org [IPv6:2607:7c80:54:e::133]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A231133A1A; Mon, 18 Apr 2022 17:44:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=bombadil.20210309; h=Sender:In-Reply-To:Content-Type: MIME-Version:References:Message-ID:Subject:Cc:To:From:Date:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=F6kyX/KE6uI93AUNHdOTXTCcOPf0qyiNPQDSWpMU/TM=; b=b/bYxKGFNiV6xLz4PGlyGIoYHS 63aXslrz0QFZ4xmrI51U5vg4KVFT92tUKixltibZDPwatSplMZ8nOhGPZ6dCawdXLyaf3M1hHSWYH b4c88WaO2DMTFRIE6Qc2uWN82agDaSzosxdofYDGLYv3evPVEaJdxSaJxi2ps6qkPEluujhkF8WtJ LWKGHMfwsqrt2yg4OFjz7msONW5apIdH71KGo++oeiCU9lwo4Cf9Y+s+xiIKPvMg/Eb2lIGPAyNfE 4R6I67g/gARRMJQe7pDv41rMP4tfZiJgt61rGdKi/t66ax6DNbW/+Bl0TksuFS66K+jfJptrHHQJz bYtuvahg==; Received: from mcgrof by bombadil.infradead.org with local (Exim 4.94.2 #2 (Red Hat Linux)) id 1ngbyd-000xTZ-WE; Tue, 19 Apr 2022 00:44:20 +0000 Date: Mon, 18 Apr 2022 17:44:19 -0700 From: Luis Chamberlain To: Mike Rapoport Cc: Song Liu , Linus Torvalds , Christoph Hellwig , Song Liu , bpf , Linux Memory Management List , open list , Alexei Starovoitov , Daniel Borkmann , Kernel Team , Andrew Morton , "Edgecombe, Rick P" , Claudio Imbrenda , Borislav Petkov , Petr Mladek , Miroslav Benes , Eric Dumazet , Daniel Borkmann , "H. Peter Anvin" Subject: Re: [PATCH v4 bpf 0/4] vmalloc: bpf: introduce VM_ALLOW_HUGE_VMAP Message-ID: References: <20220415164413.2727220-1-song@kernel.org> <4AD023F9-FBCE-4C7C-A049-9292491408AA@fb.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: Luis Chamberlain X-Spam-Status: No, score=-4.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_EF,HEADER_FROM_DIFFERENT_DOMAINS, RCVD_IN_DNSWL_MED,SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Apr 18, 2022 at 01:06:36PM +0300, Mike Rapoport wrote: > Hi, > > On Sat, Apr 16, 2022 at 10:26:08PM +0000, Song Liu wrote: > > > On Apr 16, 2022, at 1:30 PM, Linus Torvalds wrote: > > > > > > Maybe I am missing something, but I really don't think this is ready > > > for prime-time. We should effectively disable it all, and have people > > > think through it a lot more. > > > > This has been discussed on lwn.net: https://lwn.net/Articles/883454/. > > AFAICT, the biggest concern is whether reserving minimal 2MB for BPF > > programs is a good trade-off for memory usage. This is again my fault > > not to state the motivation clearly: the primary gain comes from less > > page table fragmentation and thus better iTLB efficiency. > > Reserving 2MB pages for BPF programs will indeed reduce the fragmentation, > but OTOH it will reduce memory utilization. If for large systems this may > not be an issue, on smaller machines trading off memory for iTLB > performance may be not that obvious. So the current optimization at best should be a kconfig option? > > Other folks (in recent thread on this topic and offline in other > > discussions) also showed strong interests in using similar technical > > for text of kernel modules. So I would really like to learn your > > opinion on this. There are many details we can optimize, but I guess > > the general mechanism has to be something like: > > - allocate a huge page, make it safe, and set it as executable; > > - as users (BPF, kernel module, etc.) request memory for text, give > > a chunk of the huge page to the user. > > - use some mechanism to update the chunk of memory safely. > > There are use-cases that require 4K pages with non-default permissions in > the direct map and the pages not necessarily should be executable. There > were several suggestions to implement caches of 4K pages backed by 2M > pages. Even if we just focus on the executable side of the story... there may be users who can share this too. I've gone down memory lane now at least down to year 2005 in kprobes to see why the heck module_alloc() was used. At first glance there are some old comments about being within the 2 GiB text kernel range... But some old tribal knowledge is still lost. The real hints come from kprobe work since commit 9ec4b1f356b3 ("[PATCH] kprobes: fix single-step out of line - take2"), so that the "For the %rip-relative displacement fixups to be doable"... but this got me wondering, would other users who *do* want similar funcionality benefit from a cache. If the space is limited then using a cache makes sense. Specially if architectures tend to require hacks for some of this to all work. Then, since it seems since the vmalloc area was not initialized, wouldn't that break the old JIT spray fixes, refer to commit 314beb9bcabfd ("x86: bpf_jit_comp: secure bpf jit against spraying attacks")? Is that sort of work not needed anymore? If in doubt I at least made the old proof of concept JIT spray stuff compile on recent kernels [0], but I haven't tried out your patches yet. If this is not needed anymore, why not? The collection of tribal knowedge around these sorts of things would be good to not loose and if we can share, even better. > I believe that "allocate huge page and split it to basic pages to hand out > to users" concept should be implemented at page allocator level and I > posted and RFC for this a while ago: > > https://lore.kernel.org/all/20220127085608.306306-1-rppt@kernel.org/ Neat, so although eBPF is a big user, are there some use cases outside that immediately benefit? [0] https://github.com/mcgrof/jit-spray-poc-for-ksp LUis