Received: by 2002:a05:6a10:a841:0:0:0:0 with SMTP id d1csp2691494pxy; Sun, 25 Apr 2021 01:12:36 -0700 (PDT) X-Google-Smtp-Source: ABdhPJz6bLY15tq5zOVoWWwzbV34jbd23mLC+uPapywvj0lWIyoiirg2hjwFHzAzW6vR3Nry6KZl X-Received: by 2002:aa7:8f0b:0:b029:261:85c3:c3ad with SMTP id x11-20020aa78f0b0000b029026185c3c3admr11941061pfr.26.1619338356408; Sun, 25 Apr 2021 01:12:36 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1619338356; cv=none; d=google.com; s=arc-20160816; b=VqFgEx6IF7DupD1S0n98fh0NK0VP8fN7GkIp2UrnJ6Rm7PGLjlhjuz7T7iTcNpn1Ky Yy3NhT4oqesaZaV5oe7Hjxx+RYf7OBstAXy0yz6dHyj2hOtSv9Aj/2mjFm+autU5dUBJ fIXd9sifWefidxvhc/n7l1H8EMkfXoPkNuefTDrTh77P5RV+peuOdc0TdUnBVycPkQHN mGL30rDmkQBI/FRIdNq2S9sUQHDrWrN8sGbRqqqKWMNyUMTtMvEtWwYnyKFRgeLa4CYQ bVcisgCMGnyCnkdxWYuGTdtDKqa0eiiv3WEX0RqbgprXdG7RcXrAGeLMmq9k+8E4uyLE HVFA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=ja9s7DnGWK5NRJashY05ov0nwmFYwHpa/VMTSuJsJUk=; b=WXfiVJAzmC49kVMZ/WXiWk3tNz1X/G9bihXAPagDDqq2lh9T6OKymcVINSSIvVMUXf znnTKCXfxhP3LiCf/wwkEf7zn3dThZr5fdHY4Qgy7CGREWkSwRlZQcTFSKdmzVYngVPa swW+3W4dEOAG6rPKW3tHj7i9gyMf0wNt7S6oKakcjuU2QjBRlg7c2SBMXEMlWlBbLvU5 zJiPEP2wlfoJd31iVjfagBJrHO9fAPwXAWbR3zg05sntnCSXs2mub6hrrF/hXfd23SDy jhvMteU1X37WXN9DwCnbm40zeOd5eGq/YzozmrtIPmitp70oNK+j2z1CXYYVNU7Ijf7H cylA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@uged.al header.s=google header.b=rEus9jnA; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id ob2si15623768pjb.87.2021.04.25.01.12.23; Sun, 25 Apr 2021 01:12:36 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@uged.al header.s=google header.b=rEus9jnA; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229620AbhDYIMF (ORCPT + 99 others); Sun, 25 Apr 2021 04:12:05 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45154 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229475AbhDYIME (ORCPT ); Sun, 25 Apr 2021 04:12:04 -0400 Received: from mail-lj1-x22f.google.com (mail-lj1-x22f.google.com [IPv6:2a00:1450:4864:20::22f]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D18E6C061756 for ; Sun, 25 Apr 2021 01:11:23 -0700 (PDT) Received: by mail-lj1-x22f.google.com with SMTP id d15so9167012ljo.12 for ; Sun, 25 Apr 2021 01:11:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=uged.al; s=google; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=ja9s7DnGWK5NRJashY05ov0nwmFYwHpa/VMTSuJsJUk=; b=rEus9jnA+ejmTp7ph79aQsJ7EK4671QgNJJnPWBK3K8AlAHrQ0mCob3KZgWwBV1Vo3 y0sTGUCHCZEkWsZK7TCGjz17Pp3/8kL8xW/ikeFt0tnDwP5Kc0Lye77shWXiiLd34dpf G9SaqVH0F4EigkJqWIujmNJlp0rUE3v4KHcaAOCumgb1ctcMaMEw1EC2m5MWhAm3vWqb B+yGGgWceoUAr9ZdtYaIa8OtXkabiBSzUy1pNkhHqfA9xSwgxVW9QR85qGI7J9rDXL10 Kw4vgfG4u911QsOVk499X0zx8jr9Vj50oCyjXkJSB6scdH4ywYQ+CNt6RaW8tE8A1JFp oT5A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=ja9s7DnGWK5NRJashY05ov0nwmFYwHpa/VMTSuJsJUk=; b=MgB3dHYErFvkaQQmCee+dzfVIZVaxQahE0M881Bb/zhMJYnb4CEk/AckK8rxKJNytj xzmm66Ty2sGsmAtxUB7o0sv+QNkjRtYd5hhw3M4GJqUq1BuLoFjFFe7Q+pEW5E4syJpi NFs9QrljV1mTnT0ikEbF65K43Z43exTCttkB3u2pnRnvQhbCCEs5Q96Gn9KPuCnYfsaj MDPgk+49+8Zdu2z0ODDiei6T7ANcMEgjCDM34RXzTGxk4DHX4QDl7/6cX2ablxDxUUGl d+oHk3MVTpPWUD/c6Pwje9XAlIYaviwjaR1gQ8eQ8jezFvpLNiEidXQURRnjAzFJrEj1 tCZw== X-Gm-Message-State: AOAM530ThLd/VspoZB02O1H+0qWfzC/CBabuWCu3MwLZOH5fzr0RKTiF 5Uv93d0O4HrtRYj6efdxk7tq2Q== X-Received: by 2002:a05:651c:603:: with SMTP id k3mr8618756lje.191.1619338282374; Sun, 25 Apr 2021 01:11:22 -0700 (PDT) Received: from localhost.localdomain (ti0005a400-2280.bb.online.no. [80.212.253.245]) by smtp.gmail.com with ESMTPSA id a5sm1038906lfg.115.2021.04.25.01.11.21 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 25 Apr 2021 01:11:22 -0700 (PDT) From: Odin Ugedal To: Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Daniel Bristot de Oliveira Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Odin Ugedal Subject: [PATCH 1/1] sched/fair: Fix unfairness caused by missing load decay Date: Sun, 25 Apr 2021 10:09:02 +0200 Message-Id: <20210425080902.11854-2-odin@uged.al> X-Mailer: git-send-email 2.31.1 In-Reply-To: <20210425080902.11854-1-odin@uged.al> References: <20210425080902.11854-1-odin@uged.al> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This fixes an issue where old load on a cfs_rq is not properly decayed, resulting in strange behavior where fairness can decrease drastically. Real workloads with equally weighted control groups have ended up getting a respective 99% and 1%(!!) of cpu time. When an idle task is attached to a cfs_rq by attaching a pid to a cgroup, the old load of the task is attached to the new cfs_rq and sched_entity by attach_entity_cfs_rq. If the task is then moved to another cpu (and therefore cfs_rq) before being enqueued/woken up, the load will be moved to cfs_rq->removed from the sched_entity. Such a move will happen when enforcing a cpuset on the task (eg. via a cgroup) that force it to move. The load will however not be removed from the task_group itself, making it look like there is a constant load on that cfs_rq. This causes the vruntime of tasks on other sibling cfs_rq's to increase faster than they are supposed to; causing severe fairness issues. If no other task is started on the given cfs_rq, and due to the cpuset it would not happen, this load would never be properly unloaded. With this patch the load will be properly removed inside update_blocked_averages. This also applies to tasks moved to the fair scheduling class and moved to another cpu, and this path will also fix that. For fork, the entity is queued right away, so this problem does not affect that. For a simple cgroup hierarchy (as seen below) with two equally weighted groups, that in theory should get 50/50 of cpu time each, it often leads to a load of 60/40 or 70/30. parent/ cg-1/ cpu.weight: 100 cpuset.cpus: 1 cg-2/ cpu.weight: 100 cpuset.cpus: 1 If the hierarchy is deeper (as seen below), while keeping cg-1 and cg-2 equally weighted, they should still get a 50/50 balance of cpu time. This however sometimes results in a balance of 10/90 or 1/99(!!) between the task groups. $ ps u -C stress USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 18568 1.1 0.0 3684 100 pts/12 R+ 13:36 0:00 stress --cpu 1 root 18580 99.3 0.0 3684 100 pts/12 R+ 13:36 0:09 stress --cpu 1 parent/ cg-1/ cpu.weight: 100 sub-group/ cpu.weight: 1 cpuset.cpus: 1 cg-2/ cpu.weight: 100 sub-group/ cpu.weight: 10000 cpuset.cpus: 1 This can be reproduced by attaching an idle process to a cgroup and moving it to a given cpuset before it wakes up. The issue is evident in many (if not most) container runtimes, and has been reproduced with both crun and runc (and therefore docker and all its "derivatives"), and with both cgroup v1 and v2. Fixes: 3d30544f0212 ("sched/fair: Apply more PELT fixes") Signed-off-by: Odin Ugedal --- kernel/sched/fair.c | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 794c2cb945f8..ad7556f99b4a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -10916,6 +10916,19 @@ static void attach_task_cfs_rq(struct task_struct *p) if (!vruntime_normalized(p)) se->vruntime += cfs_rq->min_vruntime; + + /* + * Make sure the attached load will decay properly + * in case the task is moved to another cpu before + * being queued. + */ + if (!task_on_rq_queued(p)) { + for_each_sched_entity(se) { + if (se->on_rq) + break; + list_add_leaf_cfs_rq(cfs_rq_of(se)); + } + } } static void switched_from_fair(struct rq *rq, struct task_struct *p) -- 2.31.1