Received: by 2002:a05:6902:102b:0:0:0:0 with SMTP id x11csp1810733ybt; Thu, 2 Jul 2020 14:39:24 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyjkO5vk3K5asg0ZfL7D0T8VXAk74QMLccSVAc901jfER4XmwZz4oRcMjw7JGi04XSwQT0r X-Received: by 2002:aa7:d90f:: with SMTP id a15mr25439113edr.86.1593725963844; Thu, 02 Jul 2020 14:39:23 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1593725963; cv=none; d=google.com; s=arc-20160816; b=sjkqqTovrrEEefRPMhGn0Y7JGKuAoZ3b1hMGiZavCMiTQRV6aWje7o14N6ND8AXjP8 8CePL/vBsUPsL+UT5qajR5h4YLcLx4HeVt7WSlXohDEPCacWTQ7+NA+2Fk8u4lJRcYW7 RRv9Nx21y7k5vljBH2q8kKJCUATAOON5KvwCt+fZdOIpf9kFWQ3rBwdYPE0FsPGlbKV5 iG7w16xwKU60qx5zcGk6HcCciPWXR4os839+KPagTlWn7wm646X0ntB6QYVpevThgGnl LdVu3ixiwhqJ9G1aSagrWo5OKT4XNFrS+AU+mgB2m0GckdbLcB/JRZaeZz8SfJ61CvUE x6Jw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:to :from:date; bh=GKDu9R/etinOO/+QeNmevMKvEuXlazE2vpecB24RJME=; b=Mt1T0Pg9TIFV1GSK1i1IO4DBIdEPENfS15qE5EG06C9PEleflg2N8+35FAfke9a+cp Dw71DaofGzhlrwVfio0+K6K9I3p0bHUkMM+LV7ium9Wxbr0rfocNvxVVANjisnzv8rGT p7Ya2P3FIxQXtDm5nPyIXUobMaQKnTm63srP4CWzeYrlWn0h0k3jmh0fyo+d6DxhUo/r ub9bXtACDWny+FsDK959wgYoYE53MwgA1kUEl6cEI8woDCMpfaNGEuvHsH62PvCShgVC 3oMqpviwt8Ij3fE+iOtP80mpfFsKQ1ax99GhPgdvjZZRkLvWHSSPy4sxpM+uD5qGqpWI FxFA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id dg8si6771985edb.184.2020.07.02.14.39.00; Thu, 02 Jul 2020 14:39:23 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726015AbgGBVgc (ORCPT + 99 others); Thu, 2 Jul 2020 17:36:32 -0400 Received: from outbound-smtp18.blacknight.com ([46.22.139.245]:44775 "EHLO outbound-smtp18.blacknight.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725915AbgGBVgb (ORCPT ); Thu, 2 Jul 2020 17:36:31 -0400 Received: from mail.blacknight.com (pemlinmail02.blacknight.ie [81.17.254.11]) by outbound-smtp18.blacknight.com (Postfix) with ESMTPS id DE5161C4194 for ; Thu, 2 Jul 2020 22:36:29 +0100 (IST) Received: (qmail 19253 invoked from network); 2 Jul 2020 21:36:29 -0000 Received: from unknown (HELO techsingularity.net) (mgorman@techsingularity.net@[84.203.21.202]) by 81.17.254.9 with ESMTPSA (AES256-SHA encrypted, authenticated); 2 Jul 2020 21:36:29 -0000 Date: Thu, 2 Jul 2020 22:36:27 +0100 From: Mel Gorman To: Dave Jones , Linux Kernel , peterz@infradead.org, mingo@kernel.org, Linus Torvalds Subject: Re: weird loadavg on idle machine post 5.7 Message-ID: <20200702213627.GF3183@techsingularity.net> References: <20200702171548.GA11813@codemonkey.org.uk> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20200702171548.GA11813@codemonkey.org.uk> User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jul 02, 2020 at 01:15:48PM -0400, Dave Jones wrote: > When I upgraded my firewall to 5.7-rc2 I noticed that on a mostly > idle machine (that usually sees loadavg hover in the 0.xx range) > that it was consistently above 1.00 even when there was nothing running. > All that perf showed was the kernel was spending time in the idle loop > (and running perf). > > For the first hour or so after boot, everything seems fine, but over > time loadavg creeps up, and once it's established a new baseline, it > never seems to ever drop below that again. > > One morning I woke up to find loadavg at '7.xx', after almost as many > hours of uptime, which makes me wonder if perhaps this is triggered > by something in cron. I have a bunch of scripts that fire off > every hour that involve thousands of shortlived runs of iptables/ipset, > but running them manually didn't seem to automatically trigger the bug. > > Given it took a few hours of runtime to confirm good/bad, bisecting this > took the last two weeks. I did it four different times, the first > producing bogus results from over-eager 'good', but the last two runs > both implicated this commit: > > commit c6e7bd7afaeb3af55ffac122828035f1c01d1d7b (refs/bisect/bad) > Author: Peter Zijlstra > Date: Sun May 24 21:29:55 2020 +0100 > > sched/core: Optimize ttwu() spinning on p->on_cpu > > Both Rik and Mel reported seeing ttwu() spend significant time on: > > smp_cond_load_acquire(&p->on_cpu, !VAL); > > Attempt to avoid this by queueing the wakeup on the CPU that owns the > p->on_cpu value. This will then allow the ttwu() to complete without > further waiting. > > Since we run schedule() with interrupts disabled, the IPI is > guaranteed to happen after p->on_cpu is cleared, this is what makes it > safe to queue early. > > Signed-off-by: Peter Zijlstra (Intel) > Signed-off-by: Mel Gorman > Signed-off-by: Ingo Molnar Peter, I'm not supremely confident about this but could it be because "p->sched_contributes_to_load = !!task_contributes_to_load(p)" potentially happens while a task is still being dequeued? In the final stages of a task switch we have prev_state = prev->state; vtime_task_switch(prev); perf_event_task_sched_in(prev, current); finish_task(prev); finish_task is when p->on_cpu is cleared after the state is updated. With the patch, we potentially update sched_contributes_to_load while p->state is transient so if the check below is true and ttwu_queue_wakelist is used then sched_contributes_to_load was based on a transient value and potentially wrong. if (smp_load_acquire(&p->on_cpu) && ttwu_queue_wakelist(p, task_cpu(p), wake_flags | WF_ON_CPU)) goto unlock; sched_contributes_to_load determines the value of rq->uninterruptible which is used in the load value so it's a partial fit. The race would not happen very often on a normal desktop so it would explain why it takes a long time for the value to get screwed up and appears to fit. I'm thinking that the !!task_contributes_to_load(p) should still happen after smp_cond_load_acquire() when on_cpu is stable and the pi_lock is held to stabilised p->state against a parallel wakeup or updating the task rq. I do not see any hazards with respect to smp_rmb and the value of p->state in this particular path but I've confused myself enough in the various scheduler and wakeup paths that I don't want to bet money on it late in the evening It builds, not booted, it's for discussion but maybe Dave is feeling brave! diff --git a/kernel/sched/core.c b/kernel/sched/core.c index ca5db40392d4..52c73598b18a 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2592,9 +2592,6 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) } #ifdef CONFIG_SMP - p->sched_contributes_to_load = !!task_contributes_to_load(p); - p->state = TASK_WAKING; - /* * Ensure we load p->on_cpu _after_ p->on_rq, otherwise it would be * possible to, falsely, observe p->on_cpu == 0. @@ -2650,6 +2647,13 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) */ smp_cond_load_acquire(&p->on_cpu, !VAL); + /* + * p is off the cpu and pi_lock is held to p->state is stable + * for calculating whether it contributes to load. + */ + p->sched_contributes_to_load = !!task_contributes_to_load(p); + p->state = TASK_WAKING; + cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags); if (task_cpu(p) != cpu) { wake_flags |= WF_MIGRATED; -- Mel Gorman SUSE Labs