Received: by 2002:a25:868d:0:0:0:0:0 with SMTP id z13csp595107ybk; Sat, 9 May 2020 12:07:45 -0700 (PDT) X-Google-Smtp-Source: APiQypLp3+xJEW3etqYscb1dzCyAvpjDcBDWlqZMBHEAyVLyTn2AvxiWZ+nMz5I7W8GTaFJIkf/C X-Received: by 2002:a50:f78c:: with SMTP id h12mr6767069edn.207.1589051265498; Sat, 09 May 2020 12:07:45 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1589051265; cv=none; d=google.com; s=arc-20160816; b=CTeD5g6wfFWxHiOhwEaLkQPxzOQFAvZnoFvUJaCiJPCeQGdOS4lKkoVoZdgdTFxYr1 5Czy8w/cD1DFu5QCQYig/rlP4tFO8viXg7d6wQh5y0zYdcAXerhsFL8x7APISGZLjy9E FJYzCX91T67LvEyI0Gv9sYHwNq1cx7cEoHZPkSKBEnkCm4k8e/qlUStLHmfJCuCRwlQo cgL9htdpUXd+UbpwHdFlgR0Qj+Y9G/+myQ/cq8jq+2uJ5J3GFt5PxGMX1WHzeLdrAz/O B5zTMNrshhH+LhIRdQKutl5jiOV8iH/2ei6ryglR500+/qapilErORgp3zJflhlLc/EU 2wzA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=MqAfLg4thXJ3nNCYcF0KuEIqHa8p4gYN+vDO/JNRO44=; b=xOlAskd9hefdi7WwdG/1w7C3pLbD2uliiN1fgWQkgUrSTdukpsTG6LKpptX5cK7jWG pwYIJ/qBak3NK0i9uZpUSpneeuD2Erzf+3KjKTNenSPWicvjUSGCtxVtoZOPnk5Mqe+Y clLvKZaHNgkWo8C8Mds+2mZt54fAVpxcdXfuHsq3eslBHVJ8ZjZd4+TrP3orf2r5t5u4 i9lEtOX8bGIbFNtwkcxI5wnCIffCwoqG3vuHUGN5IUjsYcngWvlmbSixI+KJyklxHegF A+gdrjt+lVWGC274RVmLaEVjPSbUZWJZfUSbDUx5/epeHiroGATFCfI+uMM+dJTC7Oek G2yQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=h57cjbWT; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id o16si3181286ejc.64.2020.05.09.12.07.22; Sat, 09 May 2020 12:07:45 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=h57cjbWT; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728098AbgEITFo (ORCPT + 99 others); Sat, 9 May 2020 15:05:44 -0400 Received: from mail.kernel.org ([198.145.29.99]:49556 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727938AbgEITFn (ORCPT ); Sat, 9 May 2020 15:05:43 -0400 Received: from mail-wr1-f46.google.com (mail-wr1-f46.google.com [209.85.221.46]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 69427214D8 for ; Sat, 9 May 2020 19:05:42 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1589051142; bh=FJqt+joC1S4dk3WKw1lf1NUGd5hPWz0tRXbmftk/Hqk=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=h57cjbWT99eEJDd2NqihaM1VMl/CabhJPqmHMgBBpaAtYQP9hAFOj8dcDEsVxcA7S twDbZsiY/kZTxawUphk+iPtyx8L6jiJSsroopSiC7ekvJ/OkCL5LmRQXHxQ90L1jh3 tyWFrxnmE8iYXRy298dtRj74/Hv/6K9pK+2mOCO0= Received: by mail-wr1-f46.google.com with SMTP id j5so5864578wrq.2 for ; Sat, 09 May 2020 12:05:42 -0700 (PDT) X-Gm-Message-State: AGi0PubbouCdnZCu8pwKNI1mQSNBCfXuKuubSnWzcuBR9E2N1E5+qKM6 ProY4CxVAJECBv/bkjw8e8cCjv4MbpV1++PPdOL42w== X-Received: by 2002:adf:a389:: with SMTP id l9mr5001943wrb.18.1589051140875; Sat, 09 May 2020 12:05:40 -0700 (PDT) MIME-Version: 1.0 References: <20200508144043.13893-1-joro@8bytes.org> <20200508213609.GU8135@suse.de> <20200509175217.GV8135@suse.de> In-Reply-To: <20200509175217.GV8135@suse.de> From: Andy Lutomirski Date: Sat, 9 May 2020 12:05:29 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [RFC PATCH 0/7] mm: Get rid of vmalloc_sync_(un)mappings() To: Joerg Roedel Cc: Andy Lutomirski , Joerg Roedel , X86 ML , "H. Peter Anvin" , Dave Hansen , Peter Zijlstra , "Rafael J. Wysocki" , Arnd Bergmann , Andrew Morton , Steven Rostedt , Vlastimil Babka , Michal Hocko , LKML , Linux ACPI , linux-arch , Linux-MM Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, May 9, 2020 at 10:52 AM Joerg Roedel wrote: > > On Fri, May 08, 2020 at 04:49:17PM -0700, Andy Lutomirski wrote: > > On Fri, May 8, 2020 at 2:36 PM Joerg Roedel wrote: > > > > > > On Fri, May 08, 2020 at 02:33:19PM -0700, Andy Lutomirski wrote: > > > > On Fri, May 8, 2020 at 7:40 AM Joerg Roedel wrote: > > > > > > > What's the maximum on other system types? It might make more sense to > > > > take the memory hit and pre-populate all the tables at boot so we > > > > never have to sync them. > > > > > > Need to look it up for 5-level paging, with 4-level paging its 64 pages > > > to pre-populate the vmalloc area. > > > > > > But that would not solve the problem on x86-32, which needs to > > > synchronize unmappings on the PMD level. > > > > What changes in this series with x86-32? > > This series sets ARCH_PAGE_TABLE_SYNC_MASK to PGTBL_PMD_MODIFIED, so > that the synchronization happens every time PMD(s) in the vmalloc areas > are changed. Before this series this synchronization only happened at > arbitrary places calling vmalloc_sync_(un)mappings(). > > > We already do that synchronization, right? IOW, in the cases where > > the vmalloc *fault* code does anything at all, we should have a small > > bound for how much memory to preallocate and, if we preallocate it, > > then there is nothing to sync and nothing to fault. And we have the > > benefit that we never need to sync anything on 64-bit, which is kind > > of nice. > > Don't really get you here, what is pre-allocated and why is there no > need to sync and fault then? > > > Do we actually need PMD-level things for 32-bit? What if we just > > outlawed huge pages in the vmalloc space on 32-bit non-PAE? > > Disallowing huge-pages would at least remove the need to sync > unmappings, but we still need to sync new PMD entries. Remember that the > size of the vmalloc area on 32 bit is dynamic and depends on the VM-split > and the actual amount of RAM on the system. > > A machine wit 512MB of RAM and a 1G/3G split will have around 2.5G of > VMALLOC address space. And if we want to avoid vmalloc-faults there, we > need to pre-allocate all PTE pages for that area (and the amount of PTE > pages needed increases when RAM decreases). > > On a machine with 512M of RAM we would need ca. 1270+ PTE pages, which > is around 5M (or 1% of total system memory). I can never remember which P?D name goes with which level and which machine type, but I don't think I agree with your math regardless. On x86, there are two fundamental situations that can occur: 1. Non-PAE. There is a single 4k top-level page table per mm, and this table contains either 512 or 1024 entries total. Of those entries, some fraction (half or less) control the kernel address space, and some fraction of *that* is for vmalloc space. Those entries are the *only* thing that needs syncing -- all mms will either have null (not present) in those slots or will have pointers to the *same* next-level-down directories. 2. PAE. Depending on your perspective, there could be a grand total of four top-level paging pointers, of which one (IIRC) is for the kernel. That points to the same place for all mms. Or, if you look at it the other way, PAE is just like #1 except that the top-level table has only four entries and only one points to VMALLOC space. So, unless I'm missing something here, there is an absolute maximum of 512 top-level entries that ever need to be synchronized. Now, there's an additional complication. On x86_64, we have a rule: those entries that need to be synced start out null and may, during the lifetime of the system, change *once*. They are never unmapped or modified after being allocated. This means that those entries can only ever point to a page *table* and not to a ginormous page. So, even if the hardware were to support ginormous pages (which, IIRC, it doesn't), we would be limited to merely immense and not ginormous pages in the vmalloc range. On x86_32, I don't think we have this rule right now. And this means that it's possible for one of these pages to be unmapped or modified. So my suggestion is that just apply the x86_64 rule to x86_32 as well. The practical effect will be that 2-level-paging systems will not be able to use huge pages in the vmalloc range, since the rule will be that the vmalloc-relevant entries in the top-level table must point to page *tables* instead of huge pages. On top of this, if we preallocate these entries, then the maximum amount of memory we can possibly waste is 4k * (entries pointing to vmalloc space - entries actually used for vmalloc space). I don't know what this number typically is, but I don't think it's very large. Preallocating means that vmalloc faults *and* synchronization go away entirely. All of the page tables used for vmalloc will be entirely shared by all mms, so all that's needed to modify vmalloc mappings is to update init_mm and, if needed, flush TLBs. No other page tables will need modification at all. On x86_64, the only real advantage is that the handful of corner cases that make vmalloc faults unpleasant (mostly relating to vmap stacks) go away. On x86_32, a bunch of mind-bending stuff (everything your series deletes but also almost everything your series *adds*) goes away. There may be a genuine tiny performance hit on 2-level systems due to the loss of huge pages in vmalloc space, but I'm not sure I care or that we use them anyway on these systems. And PeterZ can stop even thinking about RCU. Am I making sense? (Aside: I *hate* the PMD, etc terminology. Even the kernel's C types can't keep track of whether pmd_t* points to an entire paging directory or to a single entry. Similarly, everyone knows that a pte_t is a "page table entry", except that pte_t* might instead be a pointer to an array of 512 or 1024 page table entries.)