Date: Sat, 9 Apr 2016 17:19:00 -0700 (PDT)
From: Thomas Gleixner <tglx@linutronix.de>
To: Waiman Long <Waiman.Long@hpe.com>
cc: Ingo Molnar <mingo@redhat.com>, "H. Peter Anvin" <hpa@zytor.com>,
        Jonathan Corbet <corbet@lwn.net>, LKML <linux-kernel@vger.kernel.org>,
        linux-doc@kernel.org, x86@kernel.org, Borislav Petkov <bp@suse.de>,
        Andy Lutomirski <luto@kernel.org>,
        Scott J Norton <scott.norton@hpe.com>,
        Douglas Hatch <doug.hatch@hpe.com>, Randy Wright <rwright@hpe.com>
Subject: Re: [PATCH v2] x86/hpet: Reduce HPET counter read contention
In-Reply-To: <1460146305-14666-1-git-send-email-Waiman.Long@hpe.com>
Message-ID: <alpine.DEB.2.11.1604091552540.3786@nanos>
References: <1460146305-14666-1-git-send-email-Waiman.Long@hpe.com>
User-Agent: Alpine 2.11 (DEB 23 2013-08-11)
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 3476
Lines: 114

On Fri, 8 Apr 2016, Waiman Long wrote:
> This patch attempts to reduce HPET read contention by using the fact
> that if more than one task are trying to access HPET at the same time,
> it will be more efficient if one task in the group reads the HPET
> counter and shares it with the rest of the group instead of each
> group member reads the HPET counter individually.

That has nothing to do with tasks. clocksource reads can happen from almost
any context. The problem is concurrent access on multiple cpus.

> This optimization is enabled on systems with more than 32 CPUs. It can
> also be explicitly enabled or disabled by using the new opt_read_hpet
> kernel parameter.

Please not. What's wrong with enabling it unconditionally?
 
> +/*
>   * Clock source related code
>   */
>  static cycle_t read_hpet(struct clocksource *cs)
>  {
> -	return (cycle_t)hpet_readl(HPET_COUNTER);
> +	int seq, cnt = 0;
> +	u32 time;
> +
> +	if (opt_read_hpet <= 0)
> +		return (cycle_t)hpet_readl(HPET_COUNTER);

This wants to be conditional on CONFIG_SMP. No point in having all that muck
around for an UP kernel.

> +	seq = READ_ONCE(hpet_save.seq);
> +	if (!HPET_SEQ_LOCKED(seq)) {
> +		int old, new = seq + 1;
> +		unsigned long flags;
> +
> +		local_irq_save(flags);
> +		/*
> +		 * Set the lock bit (lsb) to get the right to read HPET
> +		 * counter directly. If successful, read the counter, save
> +		 * its value, and increment the sequence number. Otherwise,
> +		 * increment the sequnce number to the expected locked value
> +		 * for comparison later on.
> +		 */
> +		old = cmpxchg(&hpet_save.seq, seq, new);
> +		if (old == seq) {
> +			time = hpet_readl(HPET_COUNTER);
> +			WRITE_ONCE(hpet_save.hpet, time);
> +
> +			/* Unlock */
> +			smp_store_release(&hpet_save.seq, new + 1);
> +			local_irq_restore(flags);
> +			return (cycle_t)time;
> +		}
> +		local_irq_restore(flags);
> +		seq = new;
> +	}
> +
> +	/*
> +	 * Wait until the locked sequence number changes which indicates
> +	 * that the saved HPET value is up-to-date.
> +	 */
> +	while (READ_ONCE(hpet_save.seq) == seq) {
> +		/*
> +		 * Since reading the HPET is much slower than a single
> +		 * cpu_relax() instruction, we use two here in an attempt
> +		 * to reduce the amount of cacheline contention in the
> +		 * hpet_save.seq cacheline.
> +		 */
> +		cpu_relax();
> +		cpu_relax();
> +
> +		if (likely(++cnt <= HPET_RESET_THRESHOLD))
> +			continue;
> +
> +		/*
> +		 * In the unlikely event that it takes too long for the lock
> +		 * holder to read the HPET, we do it ourselves and try to
> +		 * reset the lock. This will also break a deadlock if it
> +		 * happens, for example, when the process context lock holder
> +		 * gets killed in the middle of reading the HPET counter.
> +		 */
> +		time = hpet_readl(HPET_COUNTER);
> +		WRITE_ONCE(hpet_save.hpet, time);
> +		if (READ_ONCE(hpet_save.seq) == seq) {
> +			if (cmpxchg(&hpet_save.seq, seq, seq + 1) == seq)
> +				pr_warn("read_hpet: reset hpet seq to 0x%x\n",
> +					seq + 1);

This is voodoo programming and actively dangerous.

CPU0 	        CPU1	       		CPU2
lock_hpet()
T1=read_hpet()	wait_for_unlock()	
store_hpet(T1)	
		....			
		T2 = read_hpet()
unlock_hpet()				
					lock_hpet()
					T3 = read_hpet()
					store_hpet(T3)
					unlock_hpet()
					return T3
lock_hpet()
T4 = read_hpet()			wait_for_unlock()
store_hpet(T4)	
		store_hpet(T2)			
unlock_hpet()				return T2

CPU2 will observe time going backwards.

Thanks,

	tglx