Received: by 2002:a05:6359:c8b:b0:c7:702f:21d4 with SMTP id go11csp875021rwb; Tue, 4 Oct 2022 11:54:04 -0700 (PDT) X-Google-Smtp-Source: AMsMyM6Caroih8EPtvMC3ka1QbyCMJjTkBi2QX6pvpPjVLv9/ZDQKoFnts/ISAs8nCMcBQFBLzX0 X-Received: by 2002:a63:b22:0:b0:43c:4557:c565 with SMTP id 34-20020a630b22000000b0043c4557c565mr24374743pgl.379.1664909644270; Tue, 04 Oct 2022 11:54:04 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1664909644; cv=none; d=google.com; s=arc-20160816; b=t612i57Fqab4VumvLyLm9BUxCcGYd8wahqrpapoHY3n6A4sMpSHsllkNHzy6PUtx/p HImaH6U6Vkt3x2O+FklMiIuBitoJCmK36pFO0fjNaxHbXDAQrnetDjid+Mgtq4Teek96 vH7bZQZNXLm/MsT4+soCXbcgJA8bq/elFWDot2yAcFG7Edwv3RIgxVNwZWuiZqf4ts33 xF5uNbQMVrXS1RAlyaf8Y8vugxz4mKJ5By89cQsDcngyMGH7JGk1Xc7sAWf8GUewQJKX vCzHaa+OQq80aW2mWlRzNx/QUPEAKartfCQfPZmlRewTat3SzHnP1xTUZo+IFWzxxhgU vorw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:subject:cc:to:from:date:references:in-reply-to :message-id:mime-version:user-agent:feedback-id:dkim-signature; bh=uAKdPqGLkM341w+psXbObKvDBH/mXfJRbhtP+rpuRNk=; b=LVpq9o3xCIbV3+lyvyS9SLYYK4ZhKMYDwrTbInyNhS88UFgwhO/nkNuOIYNXH3p3gM XDIYlIggrSUP5TaWIa75DOrltGZ1q40w/CvJCHBRjOUvs5V8tYafQohBjKvuHDUHufJS YOgCGAaHopjS6S8xuCtSJV/I8MQp8aFytGvfTPVlkFirld+/y1lLsLuR4zQFvkAXE9O5 UhhcNEtaGQ4Y7NNUrh81Id5CPALruGasBCo/tAQ8OsjsF/rzA16wjg2YzyktmuRNsz1z W95w43IfBnF/fktRKo/qCHB0OSJ0j52rgKgPrA5EYo3bw9SqlP3bzUWNNFWfDoUBebUu jbnw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=ldh4nLhL; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id pl14-20020a17090b268e00b0020a7e0138d2si3058676pjb.50.2022.10.04.11.53.51; Tue, 04 Oct 2022 11:54:04 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=ldh4nLhL; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229689AbiJDRoU (ORCPT + 99 others); Tue, 4 Oct 2022 13:44:20 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52324 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229786AbiJDRoQ (ORCPT ); Tue, 4 Oct 2022 13:44:16 -0400 Received: from ams.source.kernel.org (ams.source.kernel.org [IPv6:2604:1380:4601:e00::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C44D45756E; Tue, 4 Oct 2022 10:44:14 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ams.source.kernel.org (Postfix) with ESMTPS id 810DDB81B41; Tue, 4 Oct 2022 17:44:13 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id BBE70C433D7; Tue, 4 Oct 2022 17:44:10 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1664905452; bh=yIqPJAQcHpJ7XOHtFGZweD/IVZbXnAYmQPNSpNVlUZo=; h=In-Reply-To:References:Date:From:To:Cc:Subject:From; b=ldh4nLhLmF9nyGWDvyYqox+VLx3Bk7drBAgCeNkaEtCPx5ZaGM5bZcOdZnwHaGGxZ zRnxKWyuwM/UFHNMgpfwQzWQGMpxXngR/T1HW2KYfWN/gAq1xpNdFZ/LGHbnO6zP2q bYbMvhZz/USRbxjBRiblU9+jj/ZA2LbHQguAWWcGzynf/sScAYuSghxO6pCDPpr2k1 aGKb1MWbwZrJsU5Rop+9lif5jBvc6RjPVQkkWCBwBSWCYNQgKRt26PmYFWaDz/T2Fa X4gP3R9nr0NIwwNv1Dtkp6JY0hF0nrDU2P8LElzor4XxmlDthmNB4mUWG5N7i4lKeX YQDJx3lPrySjA== Received: from compute2.internal (compute2.nyi.internal [10.202.2.46]) by mailauth.nyi.internal (Postfix) with ESMTP id 8BD8C27C0054; Tue, 4 Oct 2022 13:44:09 -0400 (EDT) Received: from imap48 ([10.202.2.98]) by compute2.internal (MEProxy); Tue, 04 Oct 2022 13:44:09 -0400 X-ME-Sender: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvfedrfeeiuddguddujecutefuodetggdotefrod ftvfcurfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfgh necuuegrihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmd enucfjughrpefofgggkfgjfhffhffvvefutgesthdtredtreertdenucfhrhhomhepfdet nhguhicunfhuthhomhhirhhskhhifdcuoehluhhtoheskhgvrhhnvghlrdhorhhgqeenuc ggtffrrghtthgvrhhnpedvhfeuvddthfdufffhkeekffetgffhledtleegffetheeugeej ffduhefgteeihfenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhepmhgrihhlfh hrohhmpegrnhguhidomhgvshhmthhprghuthhhphgvrhhsohhnrghlihhthidqudduiedu keehieefvddqvdeifeduieeitdekqdhluhhtoheppehkvghrnhgvlhdrohhrgheslhhinh hugidrlhhuthhordhush X-ME-Proxy: Feedback-ID: ieff94742:Fastmail Received: by mailuser.nyi.internal (Postfix, from userid 501) id 34B4931A0063; Tue, 4 Oct 2022 13:44:09 -0400 (EDT) X-Mailer: MessagingEngine.com Webmail Interface User-Agent: Cyrus-JMAP/3.7.0-alpha0-1015-gaf7d526680-fm-20220929.001-gaf7d5266 Mime-Version: 1.0 Message-Id: In-Reply-To: <20221003222133.20948-5-aliraza@bu.edu> References: <20221003222133.20948-1-aliraza@bu.edu> <20221003222133.20948-5-aliraza@bu.edu> Date: Tue, 04 Oct 2022 10:43:47 -0700 From: "Andy Lutomirski" To: "Ali Raza" , "Linux Kernel Mailing List" Cc: "Jonathan Corbet" , masahiroy@kernel.org, michal.lkml@markovi.net, "Nick Desaulniers" , "Thomas Gleixner" , "Ingo Molnar" , "Borislav Petkov" , "Dave Hansen" , "H. Peter Anvin" , "Eric W. Biederman" , "Kees Cook" , "Peter Zijlstra (Intel)" , "Al Viro" , "Arnd Bergmann" , juri.lelli@redhat.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, "Steven Rostedt" , "Ben Segall" , mgorman@suse.de, bristot@redhat.com, vschneid@redhat.com, "Paolo Bonzini" , jpoimboe@kernel.org, linux-doc@vger.kernel.org, linux-kbuild@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-arch@vger.kernel.org, "the arch/x86 maintainers" , rjones@redhat.com, munsoner@bu.edu, tommyu@bu.edu, drepper@redhat.com, lwoodman@redhat.com, mboydmcse@gmail.com, okrieg@bu.edu, rmancuso@bu.edu, "Daniel Bristot de Oliveira" Subject: Re: [RFC UKL 04/10] x86/entry: Create alternate entry path for system calls Content-Type: text/plain X-Spam-Status: No, score=-7.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_HI, SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Oct 3, 2022, at 3:21 PM, Ali Raza wrote: > If a UKL application makes a system call, it won't go through with the > syscall assembly instruction. Instead, the application will use the call > instruction to go to the kernel entry point. Instead of adding checks to > the normal entry_SYSCALL_64 to see if we came here from a UKL task or a > normal application task, we create a totally new entry point called > ukl_entry_SYSCALL_64. This allows the normal entry point to be unchanged > and simplifies the UKL specific code as well. > > ukl_entry_SYSCALL_64 is similar to entry_SYSCALL_64 except that it has to > populate %rcx with return address manually (syscall instruction does that > automatically for normal application tasks). This allows the pt_regs to be > correct. Also, we have to push the flags onto the user stack, because on > the return path, we first switch to user stack, then pop the flags and then > return. Popping the flags would restart interrupts, so we dont want to be > stuck on kernel stack when an interrupt hits. All this can be done with an > iret instruction, but call/iret pair performans way slower than a call/ret > pair. > > Also, on the entry path, we make sure the context flag i.e., in_user is set > to 1 to indicate we are now in kernel context so any new interrupts dont > have to go through kernel entry code again. This is normally done with the > CS value on stack, but in UKL case that will always be a kernel value. On > the way back, the in_user is switched back to 2 to indicate that now > application context is being entered. All non-UKL tasks have the in_user > value set to 0. > > The UKL application uses a slightly different value for CS, instead of > 0x33, we use 0xC3. As most of the tests compare only the least significant > nibble, they behave as expected. The C value in the second nibble allows us > to distinguish between user space and UKL application code. My intuition would be to try this the other way around. Use an actual honest CS (specifically _KERNEL_CS) for pt_regs->cs. Translate at the user ABI boundary instead. After all, a UKL task is essentially just a kernel thread that happens to have a pt_regs area. > > Rest of the code makes sure the above mentioned in_user context tracking is > done for all entry and exit cases i.e., for interrupts, exceptions etc. If > its a UKL task, if in_user value is 2, we treat it as an application task, > and if it is 1, we treat it as coming from kernel context. We skip these > checks if in_user is 0. By "context tracking" are you referring to RCU? Since a UKL task is essentially a kernel thread, what "entry" is there other than setting up pt_regs? > > swapgs_restore_regs_and_return_to_usermode changes also make sure that > in_user is correct and then we iret back. > > Double fault handling is special case. Normally, if a user stack suffers a > page fault, hardware switches to a kernel stack and pushes a frame onto the > kernel stack. This switch only happens if the execution was in user > privilege level when the page fault occurred. For UKL, execution is always > in kernel level, so when the user stack suffers a page fault, no switch to > a pinned kernel stack happens, and hardware tries to push state on the > already faulting user stack. This generates a double fault. So we handle > this case in the double fault handler by assuming any double fault is > actually a user stack page fault. This can also be fixed by making all page > faults go through a pinned stack using the IST mechanism. We have tried and > tested that, but in the interest of touching as little code as possible, we > chose this option instead. Eww. I guess this is a real problem, but eww.