Received: by 2002:a05:6a10:d5a5:0:0:0:0 with SMTP id gn37csp1165388pxb; Thu, 7 Oct 2021 02:06:37 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzXMJb6U8+JJtIHP1dyeBMIHiWajftUZug0LOQNyS+G0rUnzMQ4OA3bJtJnM9oH8GM507dO X-Received: by 2002:a62:ea04:0:b0:44c:7370:e6d8 with SMTP id t4-20020a62ea04000000b0044c7370e6d8mr2889142pfh.18.1633597597189; Thu, 07 Oct 2021 02:06:37 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1633597597; cv=none; d=google.com; s=arc-20160816; b=XDMmkorIOQVCrajmFAeqRPvsuPOceZAA+H1QgsZBzG/D2Db8H7r93CxJMKIAS5Au0U IDjQ6ktjsnHZdFYVlp2D1IhzPE5iLKLCD7lHej6hssvo3blQ/jrmEBNx+WWwD86RybE/ XGAFQMKkD2poXDqgnyav+9l3W70DBl+IasCYfiIsjoJFhWyj1r5Avp+8Jxd0N7eeXhSR XZVFXbk/vnzjlURWDUKUYEiGUvAVL2uDRWuxDWXk3LJ9mqwlI6naMpKSavFTZY8PMzNP suq68U9QTnLCmGQOxk9EhxNq8pIizoSFbd37PqtTFFpY9LlECTU5nFGV/A0nWWXKAiF5 akoQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:content-language :in-reply-to:mime-version:user-agent:date:message-id:organization :from:references:cc:to:subject; bh=0tWZw5ZMv3lLQfKJ6zyLNk8gCJsTEdzqcbIFb0ESTvI=; b=vpTHjlPxgPDGEL+aXZYdj/E+ZJg1vDiYIYdz0hjuJe8t3OE9jp5yWERRMguakzqQKP eNLVAhGWqHJVk9yqCoCo64jne3sHqieveU+B0jY6csrUg2Siz8Azjjjqa3R/Yj9j96SM VJYYWfIqp51BP3qtRE2atbElpb4IkxTJD+gb2dlj7tEwJrU10bDHj6c+/CJ+QNu7lYPl JPd3e2Uzb9vy41HsY3w1d24ZR7CsXdjtgdEa5dhP6n2vf/CJRIv9TWmiuePyVf9aLdxc v7AfscuF+W1Tkm6zDjYx99Y+stzz441oDbDIXfWrsvSz1hNpEKeqxdVqvw37ZeiwSpJP rbuw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id pc8si10810258pjb.118.2021.10.07.02.06.22; Thu, 07 Oct 2021 02:06:37 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S240448AbhJGIwd (ORCPT + 99 others); Thu, 7 Oct 2021 04:52:33 -0400 Received: from mga11.intel.com ([192.55.52.93]:27056 "EHLO mga11.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S240410AbhJGIwc (ORCPT ); Thu, 7 Oct 2021 04:52:32 -0400 X-IronPort-AV: E=McAfee;i="6200,9189,10129"; a="223599432" X-IronPort-AV: E=Sophos;i="5.85,354,1624345200"; d="scan'208";a="223599432" Received: from orsmga005.jf.intel.com ([10.7.209.41]) by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 Oct 2021 01:50:33 -0700 X-IronPort-AV: E=Sophos;i="5.85,354,1624345200"; d="scan'208";a="657310946" Received: from cleane-mobl.ger.corp.intel.com (HELO [10.213.249.175]) ([10.213.249.175]) by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 Oct 2021 01:50:31 -0700 Subject: Re: [RFC 1/8] sched: Add nice value change notifier To: Barry Song <21cnbao@gmail.com> Cc: "Wanghui (John)" , Intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org, LKML , Tvrtko Ursulin , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot References: <20211004143650.699120-1-tvrtko.ursulin@linux.intel.com> <20211004143650.699120-2-tvrtko.ursulin@linux.intel.com> <562d45e1-4a27-3252-f615-3ab1ef531f2b@huawei.com> <8381e87d-ef7f-4759-569b-f6dabeb02939@linux.intel.com> From: Tvrtko Ursulin Organization: Intel Corporation UK Plc Message-ID: <382a4bd5-bb74-5928-be67-afbdc7aa3663@linux.intel.com> Date: Thu, 7 Oct 2021 09:50:27 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.13.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 06/10/2021 21:21, Barry Song wrote: > On Thu, Oct 7, 2021 at 2:44 AM Tvrtko Ursulin > wrote: >> >> >> Hi, >> >> On 06/10/2021 08:58, Barry Song wrote: >>> On Wed, Oct 6, 2021 at 5:15 PM Wanghui (John) wrote: >>>> >>>> HI Tvrtko >>>> >>>> On 2021/10/4 22:36, Tvrtko Ursulin wrote: >>>>> void set_user_nice(struct task_struct *p, long nice) >>>>> { >>>>> bool queued, running; >>>>> - int old_prio; >>>>> + int old_prio, ret; >>>>> struct rq_flags rf; >>>>> struct rq *rq; >>>>> >>>>> @@ -6915,6 +6947,9 @@ void set_user_nice(struct task_struct *p, long nice) >>>>> >>>>> out_unlock: >>>>> task_rq_unlock(rq, p, &rf); >>>>> + >>>>> + ret = atomic_notifier_call_chain(&user_nice_notifier_list, nice, p); >>>>> + WARN_ON_ONCE(ret != NOTIFY_DONE); >>>>> } >>>> How about adding a new "io_nice" to task_struct,and move the call chain to >>>> sched_setattr/getattr, there are two benefits: >>> >>> We already have an ionice for block io scheduler. hardly can this new io_nice >>> be generic to all I/O. it seems the patchset is trying to link >>> process' nice with >>> GPU's scheduler, to some extent, it makes more senses than having a >>> common ionice because we have a lot of IO devices in the systems, we don't >>> know which I/O the ionice of task_struct should be applied to. >>> >>> Maybe we could have an ionice dedicated for GPU just like ionice for CFQ >>> of bio/request scheduler. >> >> Thought crossed my mind but I couldn't see the practicality of a 3rd >> nice concept. I mean even to start with I struggle a bit with the >> usefulness of existing ionice vs nice. Like coming up with practical >> examples of usecases where it makes sense to decouple the two priorities. >> >> From a different angle I did think inheriting CPU nice makes sense for >> GPU workloads. This is because today, and more so in the future, >> computations on a same data set do flow from one to the other. >> >> Like maybe a simple example of batch image processing where CPU decodes, >> GPU does a transform and then CPU encodes. Or a different mix, doesn't >> really matter, since the main point it is one computing pipeline from >> users point of view. >> > > I am on it. but I am also seeing two problems here: > 1. nice is not global in linux. For example, if you have two cgroups, cgroup A > has more quota then cgroup B. Tasks in B won't win even if it has a lower nice. > cgroups will run proportional-weight time-based division of CPU. > > 2. Historically, we had dynamic nice which was adjusted based on the average > sleep/running time; right now, we don't have dynamic nice, but virtual time > still make tasks which sleep more preempt other tasks with the same nice > or even lower nice. > virtual time += physical time/weight by nice > so, static nice number doesn't always make sense to decide preemption. > > So it seems your patch only works under some simple situation for example > no cgroups, tasks have similar sleep/running time. Yes, I broadly agree with your assessment. Although there are plans for adding cgroup support to i915 scheduling, I doubt as fine grained control and exact semantics as there are on the CPU side will happen. Mostly because the drive seems to be for more micro-controller managed scheduling which adds further challenges in connecting the two sides together. But when you say it is a problem, I would characterize it more a weakness in terms of being only a subset of possible control. It is still richer (better?) than what currently exists and as demonstrated with benchmarks in my cover letter it can deliver improvements in user experience. If in the mid term future we can extend it with cgroup support then the concept should still apply and get closer to how you described nice works in the CPU world. Main question in my mind is whether the idea of adding the sched_attr/priority notifier to the kernel can be justified. Because as mentioned before, everything apart from adjusting currently running GPU jobs could be done purely in userspace. Stack changes would be quite extensive and all, but that is not usually a good enough reason to put something in the kernel. That's why it is an RFC an invitation to discuss. Even ionice inherits from nice (see task_nice_ioprio()) so I think argument can be made for drivers as well. Regards, Tvrtko >> In this example perhaps everything could be handled in userspace so >> that's another argument to be had. Userspace could query the current >> scheduling attributes before submitting work to the processing pipeline >> and adjust using respective uapi. >> >> Downside would be inability to react to changes after the work is >> already running which may not be too serious limitation outside the >> world of multi-minute compute workloads. And latter are probably special >> case enough that would be configured explicitly. >> >>>> >>>> 1. Decoupled with fair scheduelr. In our use case, high priority tasks often >>>> use rt scheduler. >>> >>> Is it possible to tell GPU RT as we are telling them CFS nice? >> >> Yes of course. We could create a common notification "data packet" which >> would be sent from both entry points and provide more data than just the >> nice value. Consumers (of the notifier chain) could then decide for >> themselves what they want to do with the data. > > RT should have the same problem with CFS once we have cgroups. > >> >> Regards, >> >> Tvrtko >> >>> >>>> 2. The range of value don't need to be bound to -20~19 or 0~139 >>>> >>> >>> could build a mapping between the priorities of process and GPU. It seems >>> not a big deal. >>> >>> Thanks >>> barry >>> > > Thanks > barry >