Received: by 2002:a05:6a10:16a7:0:0:0:0 with SMTP id gp39csp3072101pxb; Mon, 16 Nov 2020 05:13:26 -0800 (PST) X-Google-Smtp-Source: ABdhPJzlH7SrJ1P5nr7uywA3PVejqy2otf9AbY6ISqpuBtd/srvgfE1pkdodZq3EDwT/OjnXo5XT X-Received: by 2002:a17:906:3c55:: with SMTP id i21mr2046677ejg.347.1605532406341; Mon, 16 Nov 2020 05:13:26 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1605532406; cv=none; d=google.com; s=arc-20160816; b=jEWn48JqXP0Sfc71+XspfTZRDpJtBOpP5ZpycLwpXG9E38DvMcCxBrpkXHMm05Shse 8swfB1pnEJ+UKc3t6JsTwUiTwNmGc+v5iSCH5onQAtABfR2lvetD216UKkDQdcz8ysNV Eqg2Gp2j4inKGx6qK7OPrbQZUrMaOX4U54elt4jCq3itIaAAqPWv+ieEuDnjUKzSEmRd YMTi+0yuOPh14Ffl8UOZuhrQu6N+RfTu4XgBZk2Sb7P7zz+ZqzBQtojdCuYbPKT5qcRQ ILxXUc2d3aZFLZhJylptNUDSn4pNThCr7dLRwWH9DdVP8rG+rQiqJnD5Lmf9ATQY9ju1 NgQg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:user-agent:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :dkim-signature; bh=X/0pNPCyD+e5JWJZZcQeInUt4DzqUdOX+zi6exsxWbU=; b=f1kt1/qMNWR6WUW7HCZIo3w8CZfFdY5W52jcEtf+Jg9GF+1kYipqS0wFI+oPBym8P5 kV/wkF67Td24fR5HLrvuHW3w2GFBkcJ8qoOlwz5WLtNjkev1wfXkyx15poQcGgeVaCVz XbaLuR8QiDtQBxHDaZzwCYxr/GWG7Cvfksv1ERp5oZctObthdNzsWXyD3SShiGc3mDKk 9TIBmdzV8XjAx45zhb7ldJaHhpKU958DuMSn26bX8LZFqcEVTKhPxe/X1Tip2EtVIGwm sc0JvZExahXqddc+0A76L9Xp5jiM/PfTtPHKgoEu0D20WfvMVabJH/T9MTw48C9NKDm1 0q4w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=UB2vADxM; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id gx25si11370529ejb.445.2020.11.16.05.13.03; Mon, 16 Nov 2020 05:13:26 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=UB2vADxM; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728558AbgKPNLJ (ORCPT + 99 others); Mon, 16 Nov 2020 08:11:09 -0500 Received: from mail.kernel.org ([198.145.29.99]:60658 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726175AbgKPNLI (ORCPT ); Mon, 16 Nov 2020 08:11:08 -0500 Received: from willie-the-truck (236.31.169.217.in-addr.arpa [217.169.31.236]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 8E3CA22240; Mon, 16 Nov 2020 13:11:06 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1605532267; bh=9rGOvbCfiLF/GQwf1J/wo0b6buym/fyVMAzI0egjfu0=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=UB2vADxMRH7gfS3gHVRdEtS4Y/UhMUeUrUBJKzYM9nnzp7k1ybm82AmYSQFTvJZRs JDkolwBFUQQcY1B7VPpuCXEi0Df5ffFPeiZ+2qFuAAhFmMknU0EbPDqWa7S+ntjf5k 6Us+VyhSzVP0DXYEYwOw4tt9JwfkeBPWjg2cODn0= Date: Mon, 16 Nov 2020 13:11:03 +0000 From: Will Deacon To: Mel Gorman Cc: Peter Zijlstra , Davidlohr Bueso , linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org Subject: Re: Loadavg accounting error on arm64 Message-ID: <20201116131102.GA29992@willie-the-truck> References: <20201116091054.GL3371@techsingularity.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20201116091054.GL3371@techsingularity.net> User-Agent: Mutt/1.10.1 (2018-07-13) Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Nov 16, 2020 at 09:10:54AM +0000, Mel Gorman wrote: > I got cc'd internal bug report filed against a 5.8 and 5.9 kernel > that loadavg was "exploding" on arch64 on a machines acting as a build > servers. It happened on at least two different arm64 variants. That setup > is complex to replicate but fortunately can be reproduced by running > hackbench-process-pipes while heavily overcomitting a machine with 96 > logical CPUs and then checking if loadavg drops afterwards. With an > MMTests clone, I reproduced it as follows > > ./run-mmtests.sh --config configs/config-workload-hackbench-process-pipes --no-monitor testrun; \ > for i in `seq 1 60`; do cat /proc/loadavg; sleep 60; done > > Load should drop to 10 after about 10 minutes and it does on x86-64 but > remained at around 200+ on arm64. Do you think you could use this to bisect the problem? Also, are you able to reproduce the issue on any other arm64 machines, or just this one? > The reproduction case simply hammers the case where a task can be > descheduling while also being woken by another task at the same time. It > takes a long time to run but it makes the problem very obvious. The > expectation is that after hackbench has been running and saturating the > machine for a long time. > > Commit dbfb089d360b ("sched: Fix loadavg accounting race") fixed a loadavg > accounting race in the generic case. Later it was documented why the > ordering of when p->sched_contributes_to_load is read/updated relative > to p->on_cpu. This is critical when a task is descheduling at the same > time it is being activated on another CPU. While the load/stores happen > under the RQ lock, the RQ lock on its own does not give any guarantees > on the task state. > > Over the weekend I convinced myself that it must be because the > implementation of smp_load_acquire and smp_store_release do not appear > to implement acquire/release semantics because I didn't find something > arm64 that was playing with p->state behind the schedulers back (I could > have missed it if it was in an assembly portion as I can't reliablyh read > arm assembler). Similarly, it's not clear why the arm64 implementation > does not call smp_acquire__after_ctrl_dep in the smp_load_acquire > implementation. Even when it was introduced, the arm64 implementation > differed significantly from the arm implementation in terms of what > barriers it used for non-obvious reasons. Why would you expect smp_acquire__after_ctrl_dep() to be called as part of the smp_load_acquire() implementation? FWIW, arm64 has special instructions for acquire and release (and they actually provide more order than is strictly needed by Linux), so we just map acquire/release to those instructions directly. Since these instructions are not available on most 32-bit cores, the arm implementation just uses the fence-based implementation. Anyway, setting all that aside, I do agree with you that the bitfield usage in task_struct looks pretty suspicious. For example, in __schedule() we have: rq_lock(rq, &rf); smp_mb__after_spinlock(); ... prev_state = prev->state; if (!preempt && prev_state) { if (signal_pending_state(prev_state, prev)) { prev->state = TASK_RUNNING; } else { prev->sched_contributes_to_load = (prev_state & TASK_UNINTERRUPTIBLE) && !(prev_state & TASK_NOLOAD) && !(prev->flags & PF_FROZEN); ... deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK); where deactivate_task() updates p->on_rq directly: p->on_rq = (flags & DEQUEUE_SLEEP) ? 0 : TASK_ON_RQ_MIGRATING; so this is _not_ ordered wrt sched_contributes_to_load. But then over in __ttwu_queue_wakelist() we have: p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED); which can be invoked on the try_to_wake_up() path if p->on_rq is first read as zero and then p->on_cpu is read as 1. Perhaps these non-atomic bitfield updates can race and cause the flags to be corrupted? Then again, I went through the list of observed KCSAN splats and don't see this race showing up in there, so perhaps it's serialised by something I haven't spotted. Will