Received: by 2002:a05:6a10:9848:0:0:0:0 with SMTP id x8csp38529pxf; Wed, 24 Mar 2021 20:14:12 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyNCSvXOqhGewNeBjCZc4eD8kTsDEBiKinIvclaqy369ccJCmjfeUmGqzOgenw0Ly+IeagM X-Received: by 2002:a17:906:398a:: with SMTP id h10mr7148907eje.155.1616642052352; Wed, 24 Mar 2021 20:14:12 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1616642052; cv=none; d=google.com; s=arc-20160816; b=IyxOcL/PXbR4023rqr9abhSN/7S0w7YgYwiVl9LiwwfCUgLGpq/KNiEEz/J2/X0IKh NAgbrK7HC10V1cP+v0lEkQrpmf3bDB1xbEnDyu+0mVbX/Nl4JjMCAGrGc8/ILwznOn4g DgUnDAWhY19n/0xAY2UDyP54OggivRykNcqc4qFM+hM1vRABcT5ePPMiPNPZtmwzJsgW 4zurAOZhrcKU7Ku3qfSgAR5PeAzgiiBrEK6vEksi5JFKG5a0Yvbk3kn7pWE9bjFgL8RD pJ2M4VEu/bN6EclJ9KKATuRn4TAy+9M+WEylByoMSZICZjYtFjwx0nAktQkJxmZLW/RU YvTA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:user-agent:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date; bh=c37VqIIHUbbmN6k56UX+Xsv0v/WAFtLycnAUZ+tKITA=; b=SzqIfTfiYS1KDlxDEl86nML8Xx2mvK/CynPnyRYjPH4iZnDfl1VaR7zi7hf/OptO+a GhCGFh8ecCmtiC1N8PTq3lf1wkzfd863w7OaTDXEQCUj4s7oHASQzc9TnTSTvDYHJKFX yAAPfAkSqfsE3OVNgFS5m14WkHzQhKVdTxeMNd2eeqkogsqw6T3POpXO08Xh7bwlJTtw mHY/4ieS/qhNHxMrCpQLa7ZnbrHx+FBDq+cHxmxcEhx85GoW071Oqdba3ikHnQza4ujQ vACVtb1+hQA6m01/CrBiGs3ykVS2P02NP+FhBDoW77dhhhdDfmHkGg4CgUKMovSgKpj9 fUHg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id bs1si3390046edb.64.2021.03.24.20.13.50; Wed, 24 Mar 2021 20:14:12 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233624AbhCXPxC (ORCPT + 99 others); Wed, 24 Mar 2021 11:53:02 -0400 Received: from mx2.suse.de ([195.135.220.15]:36888 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236733AbhCXPwf (ORCPT ); Wed, 24 Mar 2021 11:52:35 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id A54D0AD9F; Wed, 24 Mar 2021 15:52:27 +0000 (UTC) Date: Wed, 24 Mar 2021 15:52:24 +0000 From: Mel Gorman To: Peter Zijlstra Cc: Josh Don , Ingo Molnar , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Daniel Bristot de Oliveira , Luis Chamberlain , Kees Cook , Iurii Zaikin , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, David Rientjes , Oleg Rombakh , linux-doc@vger.kernel.org, Paul Turner Subject: Re: [PATCH v2] sched: Warn on long periods of pending need_resched Message-ID: <20210324155224.GR15768@suse.de> References: <20210323035706.572953-1-joshdon@google.com> <20210324114224.GP15768@suse.de> <20210324133916.GQ15768@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Mar 24, 2021 at 03:36:14PM +0100, Peter Zijlstra wrote: > On Wed, Mar 24, 2021 at 01:39:16PM +0000, Mel Gorman wrote: > > > > Yeah, lets say I was pleasantly surprised to find it there :-) > > > > > > > Minimally, lets move that out before it gets kicked out. Patch below. > > OK, stuck that in front. > Thanks. > > > > Moving something like sched_min_granularity_ns will break a number of > > > > tuning guides as well as the "tuned" tool which ships by default with > > > > some distros and I believe some of the default profiles used for tuned > > > > tweak kernel.sched_min_granularity_ns > > > > > > Yeah, can't say I care. I suppose some people with PREEMPT=n kernels > > > increase that to make their server workloads 'go fast'. But I'll > > > absolutely suck rock on anything desktop. > > > > > > > Broadly speaking yes and despite the lack of documentation, enough people > > think of that parameter when tuning for throughput vs latency depending on > > the expected use of the machine. kernel.sched_wakeup_granularity_ns might > > get tuned if preemption is causing overscheduling. Same potentially with > > kernel.sched_min_granularity_ns and kernel.sched_latency_ns. That said, I'm > > struggling to think of an instance where I've seen tuning recommendations > > properly quantified other than the impact on microbenchmarks but I > > think there will be complaining if they disappear. I suspect that some > > recommended tuning is based on "I tried a number of different values and > > this seemed to work reasonably well". > > Right, except that due to that scaling thing, you'd have to re-evaluate > when you change machine. > Yes although in practice I've rarely seen that happen. What I have seen is tuning parameters being copied across machines or kernel versions that turned out to be the source of the "regression" because something changed in the scheduler that invalidated the tuning. > Also, do you have any inclination on the perf difference we're talking > about? (I should probably ask Google and not you...) > I don't have good data on hand and I don't trust Google for performance data. However, I know for certain that there are "Enterprise Applications" whose tuning relies on modifying kernel.sched_min_granularity_ns and kernel.sched_wakeup_granularity_ns at the very least (might be others, I'd have to check). The issue was severe enough to fail acceptance testing for OS upgrades and it generated bugs. I did not see the raw data but even if I had, it would have been based on a battery of tests across multiple platforms and generations so at best I would have a vague range. For the vendors in question, it is unlikely they would release detailed information because it can be seen as commercially sensitive. I don't really agree that this is useful behaviour but it is the reality so don't shoot the messenger :( The last I checked, hackbench figures could be changed in the 10-15% range either direction depending on group counts but in itself, that is not useful. > > kernel.sched_schedstats probably should not depend in SCHED_DEBUG because > > it has value for workload analysis which is not necessarily about debugging > > per-se. It might simply be informing whether another variable should be > > tuned or useful for debugging applications rather than the kernel. > > Dubious, if you're that far down the rabit hole, you're dang near > debugging. > Yes, but not necessarily the kernel. For example, the workload analysis might be to see if the maximum number of threads in a worker pool should be tuned (either up or down). > > As an aside, I wonder how often SCHED_DEBUG has been enabled simply > > because LATENCYTOP selects it -- no idea offhand why LATENCYTOP even > > needs SCHED_DEBUG. > > Perhaps schedstats used to rely on debug? I can't remember. I don't > think I've used latencytop in at least 10 years. ftrace and perf sorta > killed the need for it. > I don't think schedstats used to rely on SCHED_DEBUG. LATENCYTOP appears to build even if SCHED_DEBUG is disabled so it was either was an accident or it's no longer necessary. > > > These knobs really shouldn't have been as widely available as they are. > > > > > > > Probably not. Worse, some of the tuning is probably based on "this worked > > for workload X 10 years ago so I'll just keep doing that" > > That sounds like an excellent reason to disrupt ;-) > The same logic applies for all tuning unfortunately :P > > > > Whether there are legimiate reasons to modify those values or not, > > > > removing them may generate fun bug reports. > > > > > > Which I'll close with -EDONTCARE, userspace has to cope with > > > SCHED_DEBUG=n in any case. > > > > True but removing the throughput vs latency parameters is likely to > > generate a lot of noise even if the reasons for tuning are bad ones. > > Some definitely should not be depending on SCHED_DEBUG, others may > > need to be moved to debugfs one patch at a time so they can be reverted > > individually if complaining is excessive and there is a legiminate reason > > why it should be tuned. It's possible that complaining will be based on > > a workload regression that really depended on tuned changing parameters. > > The way I've done it, you can simply re-instate the systl table entry > and it'll work again, except for the entries that had a custom handler. > True. > I'm ready to disrupt :-) I'm not going to NAK because I do not have hard data that shows they must exist. However, I won't ACK either because I bet a lot of tasty beverages the next time we meet that the following parameters will generate reports if removed. kernel.sched_latency_ns kernel.sched_migration_cost_ns kernel.sched_min_granularity_ns kernel.sched_wakeup_granularity_ns I know they are altered by tuned for different profiles and some people do go the effort to create custom profiles for specific applications. They also show up in "Official Benchmarking" such as SPEC CPU 2017 and some vendors put a *lot* of effort into SPEC CPU results for bragging rights. They show up in technical books and best practice guids for applications. Finally they show up in Google when searching for "tuning sched_foo". I'm not saying that any of these are even accurate or a good idea, just that they show up near the top of the results and they are sufficiently popular that they might as well be an ABI. kernel.sched_latency_ns https://www.scylladb.com/2016/06/10/read-latency-and-scylla-jmx-process/ https://github.com/tikv/tikv/issues/2473 kernel.sched_migration_cost_ns https://developer.ibm.com/technologies/systems/tutorials/postgresql-experiences-tuning-recomendations-linux-on-ibm-z/ https://hunleyd.github.io/posts/tuned-PG-and-you/ https://www.postgresql.org/message-id/50E4AAB1.9040902@optionshouse.com kernel.sched_min_granularity_ns https://community.mellanox.com/s/article/rivermax-linux-performance-tuning-guide--1-x kernel.sched_wakeup_granularity_ns https://www.droidviews.com/boost-performance-on-android-kernels-task-scheduler-part-1/ -- Mel Gorman SUSE Labs