Received: by 2002:a05:6902:102b:0:0:0:0 with SMTP id x11csp1669124ybt; Thu, 2 Jul 2020 10:49:42 -0700 (PDT) X-Google-Smtp-Source: ABdhPJz7tsBOigp8n+8c0TH0TEvmAFgRqBTO1vRtk0MUbxs9ZQNDC2olZodCKIJx7vfDu9o/axo5 X-Received: by 2002:a05:6402:306a:: with SMTP id bs10mr28270808edb.51.1593712182286; Thu, 02 Jul 2020 10:49:42 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1593712182; cv=none; d=google.com; s=arc-20160816; b=WuJzh+zW8IJOfM0npPTOc17bvRflsAJH4LTICOm2RG1hGoRQdfPi+FZu56LQCjCVlp VbD15/zUks/jGRYA/n4/n2djrrTdHSIAQK5/pPdkHRcI1VsYevldfhIq5m4rC1tjMsfH 36x2DC4gMMf95ezxp1FnxQTi/1tIFqZNaIBdD8C3BPTDVGtKho9hNKJW0yHRhbXhSCU/ KV/5clbB/FGaE3okf81aiVzr3R/njDuCnSe7VXnkTJBJLwAwRiZDf8Wh2/Plb96/vQAX o0t9aDW4ObgTgz/N4RJKFiNaq+iIhrnH3rAByXu/47z1m/cibFiVqVLmdFuj68MgcyFT Vsog== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:content-disposition :mime-version:mail-followup-to:message-id:subject:cc:to:from:date; bh=0WwIPHQYFn2og6opKBZ+Ks+/95h0v8x9tyHi0l1dm3M=; b=Ar4k9Aio7/smDd3gBf3RRevoqD4qyQ/vVlyi6wDMsYPowunZYzoC1y47KUKcEi5I0q XRecRZ8ql6h7pEby9BjDwebJaHZSCVefqANPbZAuVdU6ttDN9HbNXbOCr92AZ0xzdSgq 1jsxjupwL4gBbhFxZtGb5r2HSTqQbL6VlOghQGZLRxtlLSE+jDfAFTKdlPchTgyWZwqt ES5JqeHCoTyDJLjHWNMHk1i0pQOTebQK/H+y6Eaa5hgBrVMGIYquR9VUAc577WRmEgV/ zO3ZGwYjV1iEgDgiJuJczE3fwj+gpGETEWbEd3iAl8f/XXwfCXmNeKfyEjYvktKu50sp giyA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id l6si6225505ejx.493.2020.07.02.10.49.19; Thu, 02 Jul 2020 10:49:42 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727773AbgGBRsG (ORCPT + 99 others); Thu, 2 Jul 2020 13:48:06 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50116 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727005AbgGBRsG (ORCPT ); Thu, 2 Jul 2020 13:48:06 -0400 X-Greylist: delayed 1933 seconds by postgrey-1.37 at lindbergh.monkeyblade.net; Thu, 02 Jul 2020 10:48:06 PDT Received: from scorn.kernelslacker.org (scorn.kernelslacker.org [IPv6:2600:3c03:e000:2fb::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5AB18C08C5C1 for ; Thu, 2 Jul 2020 10:48:06 -0700 (PDT) Received: from [2601:196:4600:6634:ae9e:17ff:feb7:72ca] (helo=wopr.kernelslacker.org) by scorn.kernelslacker.org with esmtp (Exim 4.92) (envelope-from ) id 1jr2oO-0007xq-D4; Thu, 02 Jul 2020 13:15:48 -0400 Received: by wopr.kernelslacker.org (Postfix, from userid 1026) id 2CC4056011F; Thu, 2 Jul 2020 13:15:48 -0400 (EDT) Date: Thu, 2 Jul 2020 13:15:48 -0400 From: Dave Jones To: Linux Kernel Cc: peterz@infradead.org, mgorman@techsingularity.net, mingo@kernel.org, Linus Torvalds Subject: weird loadavg on idle machine post 5.7 Message-ID: <20200702171548.GA11813@codemonkey.org.uk> Mail-Followup-To: Dave Jones , Linux Kernel , peterz@infradead.org, mgorman@techsingularity.net, mingo@kernel.org, Linus Torvalds MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.10.1 (2018-07-13) X-Spam-Note: SpamAssassin invocation failed Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org When I upgraded my firewall to 5.7-rc2 I noticed that on a mostly idle machine (that usually sees loadavg hover in the 0.xx range) that it was consistently above 1.00 even when there was nothing running. All that perf showed was the kernel was spending time in the idle loop (and running perf). For the first hour or so after boot, everything seems fine, but over time loadavg creeps up, and once it's established a new baseline, it never seems to ever drop below that again. One morning I woke up to find loadavg at '7.xx', after almost as many hours of uptime, which makes me wonder if perhaps this is triggered by something in cron. I have a bunch of scripts that fire off every hour that involve thousands of shortlived runs of iptables/ipset, but running them manually didn't seem to automatically trigger the bug. Given it took a few hours of runtime to confirm good/bad, bisecting this took the last two weeks. I did it four different times, the first producing bogus results from over-eager 'good', but the last two runs both implicated this commit: commit c6e7bd7afaeb3af55ffac122828035f1c01d1d7b (refs/bisect/bad) Author: Peter Zijlstra Date: Sun May 24 21:29:55 2020 +0100 sched/core: Optimize ttwu() spinning on p->on_cpu Both Rik and Mel reported seeing ttwu() spend significant time on: smp_cond_load_acquire(&p->on_cpu, !VAL); Attempt to avoid this by queueing the wakeup on the CPU that owns the p->on_cpu value. This will then allow the ttwu() to complete without further waiting. Since we run schedule() with interrupts disabled, the IPI is guaranteed to happen after p->on_cpu is cleared, this is what makes it safe to queue early. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Mel Gorman Signed-off-by: Ingo Molnar Cc: Jirka Hladky Cc: Vincent Guittot Cc: valentin.schneider@arm.com Cc: Hillf Danton Cc: Rik van Riel Link: https://lore.kernel.org/r/20200524202956.27665-2-mgorman@techsingularity.net Unfortunatly it doesn't revert cleanly on top of rc3 so I haven't confirmed 100% that it's the cause yet, but the two separate bisects seem promising. I don't see any obvious correlation between what's changing there and the symtoms (other than "scheduler magic") but maybe those closer to this have ideas what could be going awry ? Dave