Received: by 2002:ac0:aed5:0:0:0:0:0 with SMTP id t21csp4352340imb; Wed, 6 Mar 2019 11:14:23 -0800 (PST) X-Google-Smtp-Source: APXvYqwg4P5ALqpmHNIAWfYux6pXmKI7D+fXVU0UUpCK5/pEMktdwGiss+UMNNpaaj/XGIOnjmze X-Received: by 2002:a17:902:8346:: with SMTP id z6mr8723018pln.74.1551899663114; Wed, 06 Mar 2019 11:14:23 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1551899663; cv=none; d=google.com; s=arc-20160816; b=faGdK1Dyn2iQSpsRuRIHIjXQiEGfv0+3QhKG9o7n3a/I6WgzjHySFQOOMI813lBgPP vbK7RC6jdwj30ZoInQ5xA9aokyL0lweB2PNu8ElqEq/O8d16JP6KmQNMq4Sfa37wIgJi qcT6ZzpJYr+h5u7l+U8j3hK65SpywSiUSfanhS5Lc6+eZ+qs91TFqWgYWu6EIXCd3CtO eGvmapW0ElV85k75I08tsJJ2tandVaNt6rzX6i+4mcIxaufxBhP3T4HmUTxVRK48BY91 MQ7YlwupSgCGCzGlbzoTQkDSDHQwg69oRFuruS58SzOww80xrMwODOGAafed/53aONvJ i28A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=Ao3SuSfiKHh82xCnevBkwnivwIDzH4+umvUhscYa8Uo=; b=V5Sv+s+YSCwxpDhpFRX3tGUPX4WO4e2NoJCw60xSQwaSIfz4gS6tnsTZyRJufF9uhA s/a+CJolgbx8vyQf02DqyEiNYyi/iLMBGptj0iBO7CK3ytlPEFPEcZpxJNVC9AsO87U1 a2q8qrEv23s2n5udCogh0wsKW3+eBnImdfgMdxU81fQrq+BCF1GahI249gKKaXZ4Cu27 YHcm73f70HwlDgJjPdwJOvc7MJ1u66acLRXrGlHy96MqVqTdoefrm6OswekCV5hD/MVh 1tYo05/YvsvBNbtkg89V3TxR8j4W97uzCmY4l8d/3GjmFe6t/81dABu2DIITrV7UQLw4 lgXQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id w24si2133768ply.32.2019.03.06.11.14.05; Wed, 06 Mar 2019 11:14:23 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728618AbfCFQXR (ORCPT + 99 others); Wed, 6 Mar 2019 11:23:17 -0500 Received: from mx1.redhat.com ([209.132.183.28]:35104 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726166AbfCFQXQ (ORCPT ); Wed, 6 Mar 2019 11:23:16 -0500 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.phx2.redhat.com [10.5.11.16]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id B033680F7A; Wed, 6 Mar 2019 16:23:16 +0000 (UTC) Received: from pauld.bos.csb (dhcp-17-51.bos.redhat.com [10.18.17.51]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 746F55C28C; Wed, 6 Mar 2019 16:23:15 +0000 (UTC) Date: Wed, 6 Mar 2019 11:23:13 -0500 From: Phil Auld To: bsegall@google.com Cc: mingo@redhat.com, peterz@infradead.org, linux-kernel@vger.kernel.org Subject: Re: [RFC] sched/fair: hard lockup in sched_cfs_period_timer Message-ID: <20190306162313.GB8786@pauld.bos.csb> References: <20190301145209.GA9304@pauld.bos.csb> <20190304190510.GB5366@lorien.usersys.redhat.com> <20190305200554.GA8786@pauld.bos.csb> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) X-Scanned-By: MIMEDefang 2.79 on 10.5.11.16 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.27]); Wed, 06 Mar 2019 16:23:16 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Mar 05, 2019 at 12:45:34PM -0800 bsegall@google.com wrote: > Phil Auld writes: > > > Interestingly, if I limit the number of child cgroups to the number of > > them I'm actually putting processes into (16 down from 2500) the problem > > does not reproduce. > > That is indeed interesting, and definitely not something we'd want to > matter. (Particularly if it's not root->a->b->c...->throttled_cgroup or > root->throttled->a->...->thread vs root->throttled_cgroup, which is what > I was originally thinking of) > The locking may be a red herring. The setup is root->throttled->a where a is 1-2500. There are 4 threads in each of the first 16 a groups. The parent, throttled, is where the cfs_period/quota_us are set. I wonder if the problem is the walk_tg_tree_from() call in unthrottle_cfs_rq(). The distribute_cfg_runtime looks to be O(n * m) where n is number of throttled cfs_rqs and m is the number of child cgroups. But I'm not completely clear on how the hierarchical cgroups play together here. I'll pull on this thread some. Thanks for your input. Cheers, Phil --