Received: by 2002:a05:6a10:22f:0:0:0:0 with SMTP id 15csp2607803pxk; Sun, 20 Sep 2020 10:06:52 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwUohFfwLMl7vBkxdLf1Og7i5eMr2dxL1771rOVMvFsB3kvh5wtGfrKwiXDTx4uJnVMmgea X-Received: by 2002:a17:906:f897:: with SMTP id lg23mr48956899ejb.89.1600621612730; Sun, 20 Sep 2020 10:06:52 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1600621612; cv=none; d=google.com; s=arc-20160816; b=TPa3HyJ6VwPjwYGoclNLrBM/3ffoAGdRLoF6Wt0JbWBBLeVKGV0eoNqa95zLrLrfxV C8vxgP/2u2c5QvQZG8mG/QfL7a2HaKzO4Z4aBz0ZG5R0xdN7xeoc4INeC38htF9nUVNB sAJ3o9vibILTxBHNwxXz08PL+U+A0MPobeGZJFfV9r5IlkCHWe7PZLxzLWO2BsU9eCJn vLaZj/FiqtdizZIXGNOo6hVVIZVe0CUbqYoQ4lAr2zN8Ra62W/VIQsI1/XcGIHdcppio PPB2rC1LXqBnbiso96YymkThFUK/Jb3HmsxPmBT0vAF0taWM3cInJ1pN89dCU0bgYEeh Owjg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=Govkq2c7yVsAEyCudSuE8qhnS6+zy4O8TiKkmkv2nIk=; b=lNY019ijbKOBmb3A7HVGe8H4ZoGRLJ8T8JUrmDLVVATjYO97FOID5WudtT/PTlspIl Dn1U6F8XfItioOQnVtQNHPmyNbIJrxUPKnKsGmMO7KC0p2NPUcTPK0bGQhqPeaqIgRI8 ifQdyjDNofYzEUbJEYIvzXMZj3k53hfYYcycUQ4Dqm7KGLOYCvm1XonRS6IVasgydH8q 6INCr8tlJEqo1pFj3bjpAIglrfWC/VuuR9gYhQmTsLdK7sKrXgKNcbXtCkVEVh5x/iz9 gUhcebjk2nluNgfEil0WLfAcCa1oNlrgdT1NRQdsxTK6Tln0xKopEMJvKOpcCAU5pH8z sKgQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linux-foundation.org header.s=google header.b=O7YDY5EZ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id l19si6415788edj.152.2020.09.20.10.06.29; Sun, 20 Sep 2020 10:06:52 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@linux-foundation.org header.s=google header.b=O7YDY5EZ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726333AbgITRDZ (ORCPT + 99 others); Sun, 20 Sep 2020 13:03:25 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41628 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726267AbgITRDY (ORCPT ); Sun, 20 Sep 2020 13:03:24 -0400 Received: from mail-ej1-x644.google.com (mail-ej1-x644.google.com [IPv6:2a00:1450:4864:20::644]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 54F29C061755 for ; Sun, 20 Sep 2020 10:03:24 -0700 (PDT) Received: by mail-ej1-x644.google.com with SMTP id gr14so14622165ejb.1 for ; Sun, 20 Sep 2020 10:03:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux-foundation.org; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=Govkq2c7yVsAEyCudSuE8qhnS6+zy4O8TiKkmkv2nIk=; b=O7YDY5EZgHd0ux+0JI623n4+xCNSLKkbNn75HK0PfnOR4wlB9CfHJM9R8MpDNxC4Bt StinnqnJ77DBvptD2dta7OTELHl1G548xZdBUVNIayCFE6KgnVNuKWIqgtvMj+8UMN/e 7pSxwfkCe8prPQ3KSbQLetFJedeblKFxQgFMg= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=Govkq2c7yVsAEyCudSuE8qhnS6+zy4O8TiKkmkv2nIk=; b=W4HUdF5y8IL46li/aLWtqF1XJO6kSUNyNg6lQQlq/lHa09QYfyRuIdGDfLmuGoashc 83+sSg8r4ohTOPsGgHOTDX6qHwsouAY/v6fpdZCCgYQJkl9j9MXkxTiUMxbLIJvpzCCu 42aLHOyMwVjBWHWNqUM6rhcX9wijwR1WFrrsPR++y+L1mW//WOYh/0HdhLuSkDVsydnB L6zl7lUqy2UyzOKbv1xrByqblI8VXR5bRkA88pP3c6bQKqbSB7M8eJ4fRC3Tn+cnylaH HWdYPYqoqyYFn0Q+t40G3z/KbXwGjH0SOD0X4qhCB47AoB/9A4e37EY6QdcBbtUWNm+T HiWQ== X-Gm-Message-State: AOAM5320R4eEFICC6JlHBJizV7idZPF0jgBhuIzRGIwNmSsF1+/H91bn pUqw42ZbhY33Xu+4YUfrY2J30dXcISIdbA== X-Received: by 2002:a17:906:2cd2:: with SMTP id r18mr48010495ejr.371.1600621402484; Sun, 20 Sep 2020 10:03:22 -0700 (PDT) Received: from mail-ej1-f41.google.com (mail-ej1-f41.google.com. [209.85.218.41]) by smtp.gmail.com with ESMTPSA id lo25sm6938175ejb.53.2020.09.20.10.03.22 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sun, 20 Sep 2020 10:03:22 -0700 (PDT) Received: by mail-ej1-f41.google.com with SMTP id j11so14628391ejk.0 for ; Sun, 20 Sep 2020 10:03:22 -0700 (PDT) X-Received: by 2002:ac2:5594:: with SMTP id v20mr15170982lfg.344.1600621076988; Sun, 20 Sep 2020 09:57:56 -0700 (PDT) MIME-Version: 1.0 References: <20200919091751.011116649@linutronix.de> <87mu1lc5mp.fsf@nanos.tec.linutronix.de> <87k0wode9a.fsf@nanos.tec.linutronix.de> In-Reply-To: <87k0wode9a.fsf@nanos.tec.linutronix.de> From: Linus Torvalds Date: Sun, 20 Sep 2020 09:57:40 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [patch RFC 00/15] mm/highmem: Provide a preemptible variant of kmap_atomic & friends To: Thomas Gleixner Cc: LKML , linux-arch , Paul McKenney , "the arch/x86 maintainers" , Sebastian Andrzej Siewior , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Daniel Bristot de Oliveira , Will Deacon , Andrew Morton , Linux-MM , Russell King , Linux ARM , Chris Zankel , Max Filippov , linux-xtensa@linux-xtensa.org, Jani Nikula , Joonas Lahtinen , Rodrigo Vivi , David Airlie , Daniel Vetter , intel-gfx , dri-devel , Ard Biesheuvel , Herbert Xu , Vineet Gupta , "open list:SYNOPSYS ARC ARCHITECTURE" , Arnd Bergmann , Guo Ren , linux-csky@vger.kernel.org, Michal Simek , Thomas Bogendoerfer , linux-mips@vger.kernel.org, Nick Hu , Greentime Hu , Vincent Chen , Michael Ellerman , Benjamin Herrenschmidt , Paul Mackerras , linuxppc-dev , "David S. Miller" , linux-sparc Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, Sep 20, 2020 at 1:49 AM Thomas Gleixner wrote: > > Actually most usage sites of kmap atomic do not need page faults to be > disabled at all. Right. I think the pagefault disabling has (almost) nothing at all to do with the kmap() itself - it comes from the "atomic" part, not the "kmap" part. I say *almost*, because there is one issue that needs some thought: the amount of kmap nesting. The kmap_atomic() interface - and your local/temporary/whatever versions of it - depends very much inherently on being strictly nesting. In fact, it depends so much on it that maybe that should be part of the new name? It's very wrong to do addr1 = kmap_atomic(); addr2 = kmap_atomic(); ..do something with addr 1.. kunmap_atomic(addr1); .. do something with addr 2.. kunmap_atomic(addr2); because the way we allocate the slots is by using a percpu-atomic inc-return (and we deallocate using dec). So it's fundamentally a stack. And that's perfectly fine for page faults - if they do any kmaps, those will obviously nest. So the only issue with page faults might be that the stack grows _larger_. And that might need some thought. We already make the kmap stack bigger for CONFIG_DEBUG_HIGHMEM, and it's possibly that if we allow page faults we need to make the kmap stack bigger still. Btw, looking at the stack code, Ithink your new implementation of it is a bit scary: static inline int kmap_atomic_idx_push(void) { - int idx = __this_cpu_inc_return(__kmap_atomic_idx) - 1; + int idx = current->kmap_ctrl.idx++; and now that 'current->kmap_ctrl.idx' is not atomic wrt (a) NMI's (this may be ok, maybe we never do kmaps in NMIs, and with nesting I think it's fine anyway - the NMI will undo whatever it did) (b) the prev/next switch And that (b) part worries me. You do the kmap_switch_temporary() to switch the entries, but you do that *separately* from actually switching 'current' to the new value. So kmap_switch_temporary() looks safe, but I don't think it actually is. Because while it first unmaps the old entries and then remaps the new ones, an interrupt can come in, and at that point it matters what is *CURRENT*. And regardless of whether 'current' is 'prev' or 'next', that kmap_switch_temporary() loop may be doing the wrong thing, depending on which one had the deeper stack. The interrupt will be using whatever "current->kmap_ctrl.idx" is, but that might overwrite entries that are in the process of being restored (if current is still 'prev', but kmap_switch_temporary() is in the "restore @next's kmaps" pgase), or it might stomp on entries that have been pte_clear()'ed by the 'prev' thing. I dunno. The latter may be one of those "it works anyway, it overwrites things we don't care about", but the former will most definitely not work. And it will be completely impossible to debug, because it will depend on an interrupt that uses kmap_local/atomic/whatever() coming in _just_ at the right point in the scheduler, and only when the scheduler has been entered with the right number of kmap entries on the prev/next stack. And no developer will ever see this with any amount of debug code enabled, because it will only hit on legacy platforms that do this kmap anyway. So honestly, that code scares me. I think it's buggy. And even if it "happens to work", it does so for all the wrong reasons, and is very fragile. So I would suggest: - continue to use an actual per-cpu kmap_atomic_idx - make the switching code save the old idx, then unmap the old entries one by one (while doing the proper "pop" action), and then map the new entries one by one (while doing the proper "push" action). which would mean that the only index that is actually ever *USED* is the percpu one, and it's always up-to-date and pushed/popped for individual entries, rather than this - imho completely bogus - optimization where you use "p->kmap_ctrl.idx" directly and very very unsafely. Alternatively, that process counter would need about a hundred lines of commentary about exactly why it's safe. Because I don't think it is. Linus