Received: by 2002:ac0:a5a7:0:0:0:0:0 with SMTP id m36-v6csp741859imm; Fri, 17 Aug 2018 06:01:59 -0700 (PDT) X-Google-Smtp-Source: AA+uWPxURRW0bQhThToOriegQ7RTkmKK2BnvlTchkyGRPu1qxlhaP6Pm5PUynxbOCyGRu9rl4dOF X-Received: by 2002:a63:c902:: with SMTP id o2-v6mr33114397pgg.118.1534510918994; Fri, 17 Aug 2018 06:01:58 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1534510918; cv=none; d=google.com; s=arc-20160816; b=lWhRvzzXgNWz2MFRyW3eKKoeBPdwHJ0tJk6sVQ7RGOpgf8qTjvnLa7j+OFhgqnm7jS N+Y4jkIhMeTUewGGSktM6YuUT7fGs2eCvU0yQ4kOr/lyQ4Ac0lqEtUrdL+HqFagiVcg4 dmKpyZbmXZ1z5sClVty/d8A61/5dOC8RbgO+xL7aCJ4AhNqZdonuIiUXLI13ClyA/rT/ Lm4/Umb1SUbOT24cMF7bCCATPeh0i3KVtBH3BSLXHsegzfFQZO8XJw+yoOv+AbTr0eaZ OiM/Uvnlz7p9Wcqsogd2ifQhz6WEw/IeO9HtK0L1NcTeN9idYLJe0S48FbmUCg0Mwq9Y H2aQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject:arc-authentication-results; bh=5BuU/tj30YMbV7rTBAhu/3nb/WWi47ltRgko8eY9bj4=; b=v2w2FeML4cEc3BTkObObLBCQuYYAgbCgKoBQqfs2ly8G+SfJP0qmMLg7GgaQJ88J34 ilEL3Rpzao8rg3JatNyeaf6qD/ge9y1BKx+/2uKODW61A6Sfyz3B/qwmInnVnRvO74Eq HpU0rZLIMrJ1eM+aEfXFH6JTBXZZZmvYAvq5eHVHOh3H/3B9WLs7CsbL90UkVmgqpJVH pPZ8QrTX4/kXqIhv8/UVjnL8BgVlS5RDQwM5YzNB2skOiY0i5mJoQuQGaaLBzYqBnNl2 UUlrnRVDRL+hPJk7VVdFV8Fc9ahzA+dLHm22Rq99xkVaUxpB3YSlYfZHybYgbnCqmKwI hC7Q== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 31-v6si2327353pla.129.2018.08.17.06.01.42; Fri, 17 Aug 2018 06:01:58 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727233AbeHQQCQ (ORCPT + 99 others); Fri, 17 Aug 2018 12:02:16 -0400 Received: from usa-sjc-mx-foss1.foss.arm.com ([217.140.101.70]:47080 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726530AbeHQQCQ (ORCPT ); Fri, 17 Aug 2018 12:02:16 -0400 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 4930C7A9; Fri, 17 Aug 2018 05:58:56 -0700 (PDT) Received: from [10.4.12.39] (e113632-lin.emea.arm.com [10.4.12.39]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 44AA13F5BD; Fri, 17 Aug 2018 05:58:55 -0700 (PDT) Subject: Re: [PATCH] sched/fair: Avoid divide by zero when rebalancing domains To: Matt Fleming Cc: Peter Zijlstra , linux-kernel@vger.kernel.org, Ingo Molnar , Mike Galbraith References: <20180704142455.16035-1-matt@codeblueprint.co.uk> <55afee27-4143-e08c-b254-0d68a05d5ee6@arm.com> <20180705132726.GB3864@codeblueprint.co.uk> <94149109-a54c-fc5d-7b56-e786c8de5b94@arm.com> <20180817102734.GA4253@codeblueprint.co.uk> From: Valentin Schneider Message-ID: Date: Fri, 17 Aug 2018 13:58:53 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1 MIME-Version: 1.0 In-Reply-To: <20180817102734.GA4253@codeblueprint.co.uk> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, On 17/08/18 11:27, Matt Fleming wrote: > On Thu, 05 Jul, at 05:54:02PM, Valentin Schneider wrote: >> On 05/07/18 14:27, Matt Fleming wrote: >>> On Thu, 05 Jul, at 11:10:42AM, Valentin Schneider wrote: >>>> Hi, >>>> >>>> On 04/07/18 15:24, Matt Fleming wrote: >>>>> It's possible that the CPU doing nohz idle balance hasn't had its own >>>>> load updated for many seconds. This can lead to huge deltas between >>>>> rq->avg_stamp and rq->clock when rebalancing, and has been seen to >>>>> cause the following crash: >>>>> >>>>> divide error: 0000 [#1] SMP >>>>> Call Trace: >>>>> [] update_sd_lb_stats+0xe8/0x560 >> >> My confusion comes from not seeing where that crash happens. Would you mind >> sharing the associated line number? I can feel the "how did I not see this" >> from there but it can't be helped :( > > The divide by zero comes from scale_rt_capacity() where 'total' is a > u64 but gets truncated when passed to div_u64() since the divisor > parameter is u32. > Ah, nasty one. Interestingly enough that bit has been changed quite recently, so I don't think you can get a div by 0 in there anymore - see 523e979d3164 ("sched/core: Use PELT for scale_rt_capacity()") and subsequent cleanups. > Sure, you could use div64_u64() instead, but the real issue is that > the load hasn't been updated for a very long time and that we're > trying to balance the domains with stale data. > Yeah I agree with that. However, the problem is with cpu_load - blocked load on nohz CPUs will be periodically updated until entirely decayed. And if we end up getting rid of cpu_load (depends on how [1] goes), then there's nothing left to do. But we're not there yet... [1]: https://lore.kernel.org/lkml/20180809135753.21077-1-dietmar.eggemann@arm.com/