Received: by 2002:a05:7412:251c:b0:e2:908c:2ebd with SMTP id w28csp1931277rda; Tue, 24 Oct 2023 07:34:44 -0700 (PDT) X-Google-Smtp-Source: AGHT+IFg25JQO6inriJqGb03tIe0OgrvoKrKRtCCPldkN3pvDU1e91piKphEMXZRq4pCkPniiPuN X-Received: by 2002:a17:90a:4e4e:b0:278:f907:719d with SMTP id t14-20020a17090a4e4e00b00278f907719dmr9241083pjl.48.1698158083950; Tue, 24 Oct 2023 07:34:43 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1698158083; cv=none; d=google.com; s=arc-20160816; b=S0jBTsQGBlDNByE6o9rFrtd6REBpBdaVtIJTXAhUpREXGW6aXMaE90+bK6c1B6oMu0 VbNPvLVkWzSYWxfyZO+FWdWg0NPbkm41jB09mnyP/+98yjS6j4cC3D1kUMLiD24rvj6C lNk9Ri3jgeszfakxUiOEyHRUncQM/hZ1VfRubDJdOI1KJFq79PwJ0EH2I1dTagZx4l3Q sFsXB42QSReUH5GrjXobLxdE4RACDfglWBBLJ5+vL6zeE5vc8WmsehsQOP92QD5jZrb9 RkTdtlB/JzJjRalIYnS654uZvifWBKTcnElabM5jtOg8fEpwFkY5d1CVxDVDSBMMv1hV Vsgw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:subject:cc:to:from:date; bh=c16kwT8n0FImI4UoA+3jzqcIn17FPoIdHrMFtHpRXN0=; fh=2ezRygbEHLCOc3NQF9HyzpFS8IPgndzWPOTaMGhlAug=; b=ED4JFXE16AldpJrmKLmmJDjO3oV3F6dE7k0fNMburCzK4JvOlAZ7t5ovWPilhZIrOG 9Ag6c6HjKDlhJdsnNhDRtmUFI3Rk1h5hixAYeN3ZJdr6c7g17M0Qyjp4DNC0UqV2MmcA XcmY083AZuDc8Q3YmXr1kEBONzQg9bjH0lgLKl6qepiwuaZfH7COZxASHJ5PSbcEUiXz nTvijvz+o7zWIHdGeSwXrc460ys2oEDx0qDfrhjKBnuL1tTfSi/J54DEq3XZKt7yl2fn +lxPLuCl6gKD6oC1aBOh20728iDJrnoobdHimIXKIexoI86zthIotdzmVT5tjpf4tdP4 I4AQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:8 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from fry.vger.email (fry.vger.email. [2620:137:e000::3:8]) by mx.google.com with ESMTPS id h1-20020a17090ac38100b00274985b2fcdsi11019003pjt.138.2023.10.24.07.34.43 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 24 Oct 2023 07:34:43 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:8 as permitted sender) client-ip=2620:137:e000::3:8; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:8 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by fry.vger.email (Postfix) with ESMTP id F31298026437; Tue, 24 Oct 2023 07:34:40 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at fry.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234469AbjJXOee (ORCPT + 99 others); Tue, 24 Oct 2023 10:34:34 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54620 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232073AbjJXOed (ORCPT ); Tue, 24 Oct 2023 10:34:33 -0400 Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 31E468F for ; Tue, 24 Oct 2023 07:34:31 -0700 (PDT) Received: by smtp.kernel.org (Postfix) with ESMTPSA id A79B2C433C7; Tue, 24 Oct 2023 14:34:27 +0000 (UTC) Date: Tue, 24 Oct 2023 10:34:26 -0400 From: Steven Rostedt To: Thomas Gleixner Cc: Peter Zijlstra , Ankur Arora , Linus Torvalds , linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org, akpm@linux-foundation.org, luto@kernel.org, bp@alien8.de, dave.hansen@linux.intel.com, hpa@zytor.com, mingo@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, willy@infradead.org, mgorman@suse.de, jon.grimm@amd.com, bharata@amd.com, raghavendra.kt@amd.com, boris.ostrovsky@oracle.com, konrad.wilk@oracle.com, jgross@suse.com, andrew.cooper3@citrix.com, Joel Fernandes , Youssef Esmat , Vineeth Pillai , Suleiman Souhlal Subject: Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED Message-ID: <20231024103426.4074d319@gandalf.local.home> In-Reply-To: <87cyyfxd4k.ffs@tglx> References: <20230830184958.2333078-8-ankur.a.arora@oracle.com> <20230908070258.GA19320@noisy.programming.kicks-ass.net> <87zg1v3xxh.fsf@oracle.com> <87edj64rj1.fsf@oracle.com> <87zg1u1h5t.fsf@oracle.com> <20230911150410.GC9098@noisy.programming.kicks-ass.net> <87h6o01w1a.fsf@oracle.com> <20230912082606.GB35261@noisy.programming.kicks-ass.net> <87cyyfxd4k.ffs@tglx> X-Mailer: Claws Mail 3.19.1 (GTK+ 2.24.33; x86_64-pc-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-0.8 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on fry.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (fry.vger.email [0.0.0.0]); Tue, 24 Oct 2023 07:34:41 -0700 (PDT) On Tue, 19 Sep 2023 01:42:03 +0200 Thomas Gleixner wrote: > 2) When the scheduler wants to set NEED_RESCHED due it sets > NEED_RESCHED_LAZY instead which is only evaluated in the return to > user space preemption points. > > As NEED_RESCHED_LAZY is not folded into the preemption count the > preemption count won't become zero, so the task can continue until > it hits return to user space. > > That preserves the existing behaviour. I'm looking into extending this concept to user space and to VMs. I'm calling this the "extended scheduler time slice" (ESTS pronounced "estis") The ideas is this. Have VMs/user space share a memory region with the kernel that is per thread/vCPU. This would be registered via a syscall or ioctl on some defined file or whatever. Then, when entering user space / VM, if NEED_RESCHED_LAZY (or whatever it's eventually called) is set, it checks if the thread has this memory region and a special bit in it is set, and if it does, it does not schedule. It will treat it like a long kernel system call. The kernel will then set another bit in the shared memory region that will tell user space / VM that the kernel wanted to schedule, but is allowing it to finish its critical section. When user space / VM is done with the critical section, it will check the bit that may be set by the kernel and if it is set, it should do a sched_yield() or VMEXIT so that the kernel can now schedule it. What about DOS you say? It's no different than running a long system call. No task can run forever. It's not a "preempt disable", it's just "give me some more time". A "NEED_RESCHED" will always schedule, just like a kernel system call that takes a long time. The goal is to allow user space to get out of critical sections that we know can cause problems if they get preempted. Usually it's a user space / VM lock is held or maybe a VM interrupt handler that needs to wake up a task on another vCPU. If we are worried about abuse, we could even punish tasks that don't call sched_yield() by the time its extended time slice is taken. Even without that punishment, if we have EEVDF, this extension will make it less eligible the next time around. The goal is to prevent a thread / vCPU being preempted while holding a lock or resource that other threads / vCPUs will want. That is, prevent contention, as that's usually the biggest issue with performance in user space and VMs. I'm going to work on a POC, and see if I can get some benchmarks on how much this could help tasks like databases and VMs in general. -- Steve