Received: by 2002:ac0:aa62:0:0:0:0:0 with SMTP id w31-v6csp511043ima; Wed, 24 Oct 2018 05:04:50 -0700 (PDT) X-Google-Smtp-Source: AJdET5eQQTpv7hm05u352GHFW6pydZT2+tzUsX+5TtJcCaIjjiIJm77ewoQ6mgtZWRV/BmQ0V/qt X-Received: by 2002:a63:224f:: with SMTP id t15mr2265729pgm.69.1540382689977; Wed, 24 Oct 2018 05:04:49 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1540382689; cv=none; d=google.com; s=arc-20160816; b=pS6mXpcrEr6Dl4Yeo9e+0JO6bAmTUrMKhxEn9YMsvyb4SS2VKcX+pjg4wyoCCReZzM cJfP5k/TTpoKQyi3D3ybBNU7x6sEOTAvlVoKe7oS7DcSFg4vCUynvZEY/XCC9Ycm6tat r8YEo5+sVCwyFsUF92sBG3gj0qAKgGiviRV0jssYvNRMG1ZF9eMiy3Y9Dzh1Ih6Ndx4r klEORZzfTLF2qkkDOIRdPCf5FF8cT6U+vKpxZlGigP3By4AyjaChebPbuBycvrxW2Ncg djT1cvuoXpdori50SR4JOtx3cFcLLFglDMW5p8UA9y7ANjQg1GcNKc3tEkWT1SfD4hbg G5uw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=B8doBJD3/qjC/J7Sor+IGfoF6mcAkBs44PxfDbMFYSw=; b=lsX4iwO5aFUVHU1bDhHhSaKXd7FeczruLFevwVmi63GkDP8ZfzKoyyVoLlfE81ctEh p6WBjHUdQQ0suIxL80pNsqaRLagZay0Z5jzPDTcGR/FK0vQwgiZy4+rWOA9IH2vT0c1o TG5D7KHmfXQPfwIg4837j87J1Fwrb72vblHqaGGkh7HYf7Gj74q0dy1bqWrLyklTilXe 6SOUnL8h3OqxtltTI5cMAHpgbIN1xPml4IaS3k6cmJ2r9Z2SOut4zucZbJzxYm9LbW8A df3OqaEI8QI6hlUVsHJvSUYYp69bulPa04f7xxSqTuwMTD2jRis1iVS388jMtd8EEWc4 N/pQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id o4-v6si4467280pgl.68.2018.10.24.05.04.31; Wed, 24 Oct 2018 05:04:49 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727386AbeJXUbf (ORCPT + 99 others); Wed, 24 Oct 2018 16:31:35 -0400 Received: from mail-wm1-f68.google.com ([209.85.128.68]:56265 "EHLO mail-wm1-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726812AbeJXUbe (ORCPT ); Wed, 24 Oct 2018 16:31:34 -0400 Received: by mail-wm1-f68.google.com with SMTP id s10-v6so3043597wmc.5 for ; Wed, 24 Oct 2018 05:03:42 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=B8doBJD3/qjC/J7Sor+IGfoF6mcAkBs44PxfDbMFYSw=; b=lRtGk+qIxYjlgsRZ7533u49FPFKXigwRBaEiiE5voZhzqJnwndcO0yAuBJz7YWR/DI 6w2DMDz7tZD5EfMFbt0xJcb8biVLEMrLTkqEdvURvyEpzK8ZGNIwE98oGKWh0+Lq0Ivl 1CFPWa2wdle3Q7GP3R5HVJ9pq8eA2uccxAJpjtu/OV54fc/FbivypSXhfiCm6HSlheep XY08fzrCI+rDZ8JZ8jBulhGndNzJOgPMwpGmf1nZBez4/wEKWjueICtmuL4h++4IFZz9 5L22OZA+5ii8dsK5qoyXlg7JDWDHqYvGuuPWgwu+ZAhm9mPuejVmZ9MB2nLHZQ3hdKBj Uy8A== X-Gm-Message-State: AGRZ1gILtTb/s4Z2cbcKsQ+Z8NhbdKmhgPep+4186x+4VPdMrnDmmpd1 mqyB0eTNuXRkUD55vme57X/Iyg== X-Received: by 2002:a1c:b504:: with SMTP id e4-v6mr2318387wmf.134.1540382620821; Wed, 24 Oct 2018 05:03:40 -0700 (PDT) Received: from localhost.localdomain (bo-18-130-187.service.infuturo.it. [151.18.130.187]) by smtp.gmail.com with ESMTPSA id e21-v6sm7010624wma.8.2018.10.24.05.03.38 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Wed, 24 Oct 2018 05:03:39 -0700 (PDT) Date: Wed, 24 Oct 2018 14:03:35 +0200 From: Juri Lelli To: Peter Zijlstra Cc: luca abeni , Thomas Gleixner , Juri Lelli , syzbot , Borislav Petkov , "H. Peter Anvin" , LKML , mingo@redhat.com, nstange@suse.de, syzkaller-bugs@googlegroups.com, henrik@austad.us, Tommaso Cucinotta , Claudio Scordino , Daniel Bristot de Oliveira Subject: Re: INFO: rcu detected stall in do_idle Message-ID: <20181024120335.GE29272@localhost.localdomain> References: <20181016140322.GB3121@hirez.programming.kicks-ass.net> <20181016144045.GF9130@localhost.localdomain> <20181016153608.GH9130@localhost.localdomain> <20181018082838.GA21611@localhost.localdomain> <20181018122331.50ed3212@luca64> <20181018104713.GC21611@localhost.localdomain> <20181018130811.61337932@luca64> <20181019113942.GH3121@hirez.programming.kicks-ass.net> <20181019225005.61707c64@nowhere> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20181019225005.61707c64@nowhere> User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 19/10/18 22:50, luca abeni wrote: > On Fri, 19 Oct 2018 13:39:42 +0200 > Peter Zijlstra wrote: > > > On Thu, Oct 18, 2018 at 01:08:11PM +0200, luca abeni wrote: > > > Ok, I see the issue now: the problem is that the "while > > > (dl_se->runtime <= 0)" loop is executed at replenishment time, but > > > the deadline should be postponed at enforcement time. > > > > > > I mean: in update_curr_dl() we do: > > > dl_se->runtime -= scaled_delta_exec; > > > if (dl_runtime_exceeded(dl_se) || dl_se->dl_yielded) { > > > ... > > > enqueue replenishment timer at dl_next_period(dl_se) > > > But dl_next_period() is based on a "wrong" deadline! > > > > > > > > > I think that inserting a > > > while (dl_se->runtime <= -pi_se->dl_runtime) { > > > dl_se->deadline += pi_se->dl_period; > > > dl_se->runtime += pi_se->dl_runtime; > > > } > > > immediately after "dl_se->runtime -= scaled_delta_exec;" would fix > > > the problem, no? > > > > That certainly makes sense to me. > > Good; I'll try to work on this idea in the weekend. So, we (me and Luca) managed to spend some more time on this and found a few more things worth sharing. I'll try to summarize what we have got so far (including what already discussed) because impression is that each point might deserve a fix or at least consideration (just amazing how a simple random fuzzer thing can highlight all that :). Apologies for the long email. Reproducer runs on a CONFIG_HZ=100, CONFIG_IRQ_TIME_ACCOUNTING kernel and does something like this (only the bits that seems to matter here) int main(void) { [...] [setup stuff at 0x2001d000] syscall(__NR_perf_event_open, 0x2001d000, 0, -1, -1, 0); *(uint32_t*)0x20000000 = 0; *(uint32_t*)0x20000004 = 6; *(uint64_t*)0x20000008 = 0; *(uint32_t*)0x20000010 = 0; *(uint32_t*)0x20000014 = 0; *(uint64_t*)0x20000018 = 0x9917; <-- ~40us *(uint64_t*)0x20000020 = 0xffff; <-- ~65us (~60% bandwidth) *(uint64_t*)0x20000028 = 0; syscall(__NR_sched_setattr, 0, 0x20000000, 0); [busy loop] return 0; } And this causes problems because the task is actually never throttled. Pain points: 1. Granularity of enforcement (at each tick) is huge compared with the task runtime. This makes starting the replenishment timer, when runtime is depleted, always to fail (because old deadline is way in the past). So, the task is fully replenished and put back to run. - Luca's proposal should help here, since the deadline is postponed at throttling time, and replenishment timer set to that (and it should be in the future) 1.1 Even if we fix 1. in a configuration like this, the task would still be able to run for ~10ms (worst case) and potentially starve other tasks. It doesn't seem a too big interval maybe, but there might be other very short activities that might miss an occasion to run "quickly". - Might be fixed by imposing (via sysctl) reasonable defaults for minimum runtime (w.r.t. HZ, like HZ/2) and maximum for period (as also a very small bandwidth task can have a big runtime if period is big as well) (1.2) When runtime becomes very negative (because delta_exec was big) we seem to spend lot of time inside the replenishment loop. - Not sure it's such a big problem, might need more profiling. Feeling is that once the other points will be addressed this won't matter anymore 2. This is related to perf_event_open syscall reproducer does before becoming DEADLINE and entering the busy loop. Enabling of perf swevents generates lot of hrtimers load that happens in the reproducer task context. Now, DEADLINE uses rq_clock() for setting deadlines, but rq_clock_task() for doing runtime enforcement. In a situation like this it seems that the amount of irq pressure becomes pretty big (I'm seeing this on kvm, real hw should maybe do better, pain point remains I guess), so rq_clock() and rq_clock_task() might become more a more skewed w.r.t. each other. Since rq_clock() is only used when setting absolute deadlines for the first time (or when resetting them in certain cases), after a bit the replenishment code will start to see postponed deadlines always in the past w.r.t. rq_clock(). And this brings us back to the fact that the task is never stopped, since it can't keep up with rq_clock(). - Not sure yet how we want to address this [1]. We could use rq_clock() everywhere, but tasks might be penalized by irq pressure (theoretically this would mandate that irqs are explicitly accounted for I guess). I tried to use the skew between the two clocks to "fix" deadlines, but that puts us at risks of de-synchronizing userspace and kernel views of deadlines. 3. HRTICK is not started for new entities. - Already got a patch for it. This should be it, I hope. Luca (thanks a lot for your help) and please add or correct me if I was wrong. Thoughts? Best, - Juri 1 - https://elixir.bootlin.com/linux/latest/source/kernel/sched/deadline.c#L1162