Date: Sat, 16 May 2015 00:17:30 +0200 (CEST)
From: Thomas Gleixner <tglx@linutronix.de>
To: Chris Metcalf <cmetcalf@ezchip.com>
cc: Gilad Ben Yossef <giladb@ezchip.com>, Steven Rostedt <rostedt@goodmis.org>,
        Ingo Molnar <mingo@kernel.org>, Peter Zijlstra <peterz@infradead.org>,
        Andrew Morton <akpm@linux-foundation.org>,
        Rik van Riel <riel@redhat.com>, Tejun Heo <tj@kernel.org>,
        Frederic Weisbecker <fweisbec@gmail.com>,
        "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
        Christoph Lameter <cl@linux.com>,
        Viresh Kumar <viresh.kumar@linaro.org>, linux-doc@vger.kernel.org,
        linux-api@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v2 1/5] nohz_full: add support for "cpu_isolated" mode
In-Reply-To: <1431725251-20943-1-git-send-email-cmetcalf@ezchip.com>
Message-ID: <alpine.DEB.2.11.1505152337170.4225@nanos>
References: <1431725178-20876-1-git-send-email-cmetcalf@ezchip.com> <1431725251-20943-1-git-send-email-cmetcalf@ezchip.com>
User-Agent: Alpine 2.11 (DEB 23 2013-08-11)
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 3059
Lines: 91

On Fri, 15 May 2015, Chris Metcalf wrote:
> +/*
> + * We normally return immediately to userspace.
> + *
> + * In "cpu_isolated" mode we wait until no more interrupts are
> + * pending.  Otherwise we nap with interrupts enabled and wait for the
> + * next interrupt to fire, then loop back and retry.
> + *
> + * Note that if you schedule two "cpu_isolated" processes on the same
> + * core, neither will ever leave the kernel, and one will have to be
> + * killed manually.

And why are we not preventing that situation in the first place? The
scheduler should be able to figure that out easily..

> +  Otherwise in situations where another process is
> + * in the runqueue on this cpu, this task will just wait for that
> + * other task to go idle before returning to user space.
> + */
> +void tick_nohz_cpu_isolated_enter(void)
> +{
> +	struct clock_event_device *dev =
> +		__this_cpu_read(tick_cpu_device.evtdev);
> +	struct task_struct *task = current;
> +	unsigned long start = jiffies;
> +	bool warned = false;
> +
> +	/* Drain the pagevecs to avoid unnecessary IPI flushes later. */
> +	lru_add_drain();
> +
> +	while (ACCESS_ONCE(dev->next_event.tv64) != KTIME_MAX) {

What's the ACCESS_ONCE for?

> +		if (!warned && (jiffies - start) >= (5 * HZ)) {
> +			pr_warn("%s/%d: cpu %d: cpu_isolated task blocked for %ld jiffies\n",
> +				task->comm, task->pid, smp_processor_id(),
> +				(jiffies - start));

What additional value has the jiffies delta over a plain human
readable '5sec' ?

> +			warned = true;
> +		}
> +		if (should_resched())
> +			schedule();
> +		if (test_thread_flag(TIF_SIGPENDING))
> +			break;
> +
> +		/* Idle with interrupts enabled and wait for the tick. */
> +		set_current_state(TASK_INTERRUPTIBLE);
> +		arch_cpu_idle();

Oh NO! Not another variant of fake idle task. The idle implementations
can call into code which rightfully expects that the CPU is actually
IDLE.

I wasted enough time already debugging the resulting wreckage. Feel
free to use it for experimental purposes, but this is not going
anywhere near to a mainline kernel.

I completely understand WHY you want to do that, but we need proper
mechanisms for that and not some duct tape engineering band aids which
will create hard to debug side effects.

Hint: It's a scheduler job to make sure that the machine has quiesced
      _BEFORE_ letting the magic task off to user land.

> +		set_current_state(TASK_RUNNING);
> +	}
> +	if (warned) {
> +		pr_warn("%s/%d: cpu %d: cpu_isolated task unblocked after %ld jiffies\n",
> +			task->comm, task->pid, smp_processor_id(),
> +			(jiffies - start));
> +		dump_stack();

And that dump_stack() tells us which important information?

    	 tick_nohz_cpu_isolated_enter
	 context_tracking_enter
	 context_tracking_user_enter
	 arch_return_to_user_code

Thanks,

	tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/