Received: by 2002:ac0:a591:0:0:0:0:0 with SMTP id m17-v6csp633453imm; Thu, 5 Jul 2018 06:28:44 -0700 (PDT) X-Google-Smtp-Source: AAOMgpfHFiws/saQUThEivK0G+4n7A/w7lavxRuDxEMHiUvQdfu+v+xinYmX0576OT6vv6VSAhk1 X-Received: by 2002:a17:902:e00a:: with SMTP id ca10-v6mr6191484plb.224.1530797324488; Thu, 05 Jul 2018 06:28:44 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1530797324; cv=none; d=google.com; s=arc-20160816; b=ThyH0TO0LHqev57xaNhTVqLXTWHkfVjGjt0CEXGifAU4+2IzYw7A4c/MTGMKL8Bbnu SdeuQld+K2PcmkAv0PAWm1q9wdLc2DXFXBJKaUYXTzgYVs01JRCJ84YUjENLpjrGuKhu IvZQ4Qs3gU/o1tt1bC38iD1jSIqd7NIC/xqZzkn6mN1QRnzRMOPIxSXGetjfEbOpM4CU 66ZrvEmVHJA7Z2Y06idYSUp1x95iC5qh3Dycg1ghBXWQ8yA5UpZr1ZOLhGqazXJlwSH0 juK+DsLlTCycU6l1gQhGewmcJGmgnc9X3edigMzWGjzHvz7NXV5FEzeqQ8J0aMIWDCE0 Fxtw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature:arc-authentication-results; bh=hpOq+MYuW1+0odr2YalybUI/Z4hJwGOuXfyVJFs5yRQ=; b=WyatMDnQ9BUazejKXOhgTng1JM9LFSfbkbQ1aXuPlgq5oMX8ItQFtd8tOSwcDRlLCA gX/OBiCQZSwpwyWf6mFDbT5Sv9XlwItm88blbAnAw/zPqC6ZHOQYULX0XvPG77uZxflh CNNuQYaGGek8JSzQ6/sqYfAmY1TSRdII+wAIor4AN8sTp2bPsVSK4zcCUjZ2hNrR1CZ6 wihrAQZylyTziNUUdul/5K4THMqsk+zC02Rx8WAV9XFh4X2sz1Xi4+a8IbhZuX7qV0+Q RCPYNcah8mRyRO4dD2ZXXPG3X8qHa6aE1t0B5JvZWg5QuUISwvHEENq3D/BJ0Ij7iDMD 6jbQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@codeblueprint-co-uk.20150623.gappssmtp.com header.s=20150623 header.b=D5tWSn6A; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 33-v6si5909713plf.133.2018.07.05.06.28.30; Thu, 05 Jul 2018 06:28:44 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@codeblueprint-co-uk.20150623.gappssmtp.com header.s=20150623 header.b=D5tWSn6A; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753303AbeGEN13 (ORCPT + 99 others); Thu, 5 Jul 2018 09:27:29 -0400 Received: from mail-ed1-f66.google.com ([209.85.208.66]:45123 "EHLO mail-ed1-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753120AbeGEN12 (ORCPT ); Thu, 5 Jul 2018 09:27:28 -0400 Received: by mail-ed1-f66.google.com with SMTP id g15-v6so6348962edr.12 for ; Thu, 05 Jul 2018 06:27:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=codeblueprint-co-uk.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=hpOq+MYuW1+0odr2YalybUI/Z4hJwGOuXfyVJFs5yRQ=; b=D5tWSn6A2wC8hyrXOUaRBuNpCMaaJy4w/MKlE4jnmgp6No0cEsp4iUxwDQT+tdu+gc DY89OYRd3STAE/m0LfkuZ8YuWqePL9fh114iQRQWAwYn5RugACD8lAoER7/1TjqkFSy0 pWrVXjW6rYbZ0qQ5kcdF6nH+5M+yfJySdLbuYnuMiyF2EvjDkJK3HAXbw3fu04qgyfmE nSsbOtfLTbdiZJPuckcwKYDS0RLm1qMpRDBG3qWQnHBpWFu4Lm3AuiT0kn8Dbs75F5jl ZCsVmVjqHHwnGK2Q5vjvU9Wjg2wWZX/MTJI94HXZnTNQarxMQS4xJBDzFN9+IEC3LwUL /xXw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=hpOq+MYuW1+0odr2YalybUI/Z4hJwGOuXfyVJFs5yRQ=; b=ujZcayX3UppUOi1ZQJwGkOILZN6Oi4uLX80e9aLVOuAzCRLg/AOr52i5GzDGcROXN5 OLsUhbDRaUCIBkexeN9LepXpjAbBumeh7zmBPz5l4Lm6lwylSGzjvfXAq/1Iu1bKPEGD e06RuQMAoJJhDe2HZXGzC4miBRn+sYejeH2srWGpoSUvOIWM6lpijzeId6D8nOEMRtDQ FeuoSPE+K1pVdkqRdRKd6lQqJnkDkqGdtl8d08I+1WLM0ErqaCfC3K+Fa2wa//Q7L9bS o+F2Q/MFbU8O8UJSqFEmTX7oq2Zc/e6U8lMh/L2uvZoKxHYhzJEA7KoMmNJgduC+azAx oY9g== X-Gm-Message-State: APt69E30lSsMuNyBmkk8WyHIuvW4cJbpiMDJcPmIops/mWY6HMk+GjYP iRZ6EyLFt+YexbffZ0a3ePkkUw== X-Received: by 2002:a50:9935:: with SMTP id k50-v6mr6534511edb.45.1530797247607; Thu, 05 Jul 2018 06:27:27 -0700 (PDT) Received: from localhost ([2a02:c7f:9214:6300:de53:60ff:fe39:5599]) by smtp.gmail.com with ESMTPSA id x11-v6sm7055115edb.39.2018.07.05.06.27.26 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 05 Jul 2018 06:27:26 -0700 (PDT) Date: Thu, 5 Jul 2018 14:27:26 +0100 From: Matt Fleming To: Valentin Schneider Cc: Peter Zijlstra , linux-kernel@vger.kernel.org, Ingo Molnar , Mike Galbraith Subject: Re: [PATCH] sched/fair: Avoid divide by zero when rebalancing domains Message-ID: <20180705132726.GB3864@codeblueprint.co.uk> References: <20180704142455.16035-1-matt@codeblueprint.co.uk> <55afee27-4143-e08c-b254-0d68a05d5ee6@arm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <55afee27-4143-e08c-b254-0d68a05d5ee6@arm.com> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 05 Jul, at 11:10:42AM, Valentin Schneider wrote: > Hi, > > On 04/07/18 15:24, Matt Fleming wrote: > > It's possible that the CPU doing nohz idle balance hasn't had its own > > load updated for many seconds. This can lead to huge deltas between > > rq->avg_stamp and rq->clock when rebalancing, and has been seen to > > cause the following crash: > > > > divide error: 0000 [#1] SMP > > Call Trace: > > [] update_sd_lb_stats+0xe8/0x560 > > [] find_busiest_group+0x2d/0x4b0 > > [] load_balance+0x170/0x950 > > [] rebalance_domains+0x13f/0x290 > > [] __do_softirq+0xec/0x300 > > [] irq_exit+0xfa/0x110 > > [] reschedule_interrupt+0xc9/0xd0 > > > > Do you have some sort of reproducer for that crash? If not I guess I can cook > something up with a quiet userspace & rt-app, though I've never seen that one > on arm64. Unfortunately no, I don't have a reproduction case. Would love to have one to verify the patch though. > > Make sure we update the rq clock and load before balancing. > > > > Cc: Ingo Molnar > > Cc: Mike Galbraith > > Cc: Peter Zijlstra > > Signed-off-by: Matt Fleming > > --- > > kernel/sched/fair.c | 10 ++++++++++ > > 1 file changed, 10 insertions(+) > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 2f0a0be4d344..2c81662c858a 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -9597,6 +9597,16 @@ static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags, > > */ > > smp_mb(); > > > > + /* > > + * Ensure this_rq's clock and load are up-to-date before we > > + * rebalance since it's possible that they haven't been > > + * updated for multiple schedule periods, i.e. many seconds. > > + */ > > + raw_spin_lock_irq(&this_rq->lock); > > + update_rq_clock(this_rq); > > + cpu_load_update_idle(this_rq); > > + raw_spin_unlock_irq(&this_rq->lock); > > + > > I'm failing to understand why the updates further down below are seemingly > not enough. After we've potentially done > > update_rq_clock(rq); > cpu_load_update_idle(rq); > > for all nohz cpus != this_cpu, we still end up doing: > > if (idle != CPU_NEWLY_IDLE) { > update_blocked_averages(this_cpu); > has_blocked_load |= this_rq->has_blocked_load; > } > > which should properly update this_rq's clock and load before we attempt to do > any balancing on it. But cpu_load_update_idle() and update_blocked_averages() are not the same thing.