Date: Thu, 28 Sep 2017 14:37:58 +0200
From: Peter Zijlstra <peterz@infradead.org>
To: Rik van Riel <riel@redhat.com>
Cc: Eric Farman <farman@linux.vnet.ibm.com>,
        ????????? <jinpuwang@gmail.com>, LKML <linux-kernel@vger.kernel.org>,
        Ingo Molnar <mingo@redhat.com>,
        Christian Borntraeger <borntraeger@de.ibm.com>,
        "KVM-ML (kvm@vger.kernel.org)" <kvm@vger.kernel.org>,
        vcaputo@pengaru.com, Matthew Rosato <mjrosato@linux.vnet.ibm.com>
Subject: Re: sysbench throughput degradation in 4.13+
Message-ID: <20170928123758.robe5ggsjf4voj7h@hirez.programming.kicks-ass.net>
References: <95edafb1-5e9d-8461-db73-bcb002b7ebef@linux.vnet.ibm.com>
 <CAD9gYJJ9nSAbznEn80hfY3=+YjA8cKw6RztpgW6iDm7rQ0EsFg@mail.gmail.com>
 <50a279d3-84eb-3403-f2f0-854934778037@linux.vnet.ibm.com>
 <20170922155348.zujigkn3o5eylctn@hirez.programming.kicks-ass.net>
 <754f5a9f-5332-148d-2631-918fc7a7cfe9@linux.vnet.ibm.com>
 <20170927093530.s3sgdz2vamc5ka4w@hirez.programming.kicks-ass.net>
 <20170927135820.61cd077f@cuia.usersys.redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20170927135820.61cd077f@cuia.usersys.redhat.com>
User-Agent: NeoMutt/20170609 (1.8.3)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 1693
Lines: 44

On Wed, Sep 27, 2017 at 01:58:20PM -0400, Rik van Riel wrote:
> @@ -5359,10 +5378,14 @@ wake_affine_llc(struct sched_domain *sd, struct task_struct *p,
>  		unsigned long current_load = task_h_load(current);
>  
>  		/* in this case load hits 0 and this LLC is considered 'idle' */
> -		if (current_load > this_stats.load)
> +		if (current_load > this_stats.max_load)
> +			return true;
> +
> +		/* allow if the CPU would go idle, regardless of LLC load */
> +		if (current_load >= target_load(this_cpu, sd->wake_idx))
>  			return true;
>  
> -		this_stats.load -= current_load;
> +		this_stats.max_load -= current_load;
>  	}
>  
>  	/*
> @@ -5375,10 +5398,6 @@ wake_affine_llc(struct sched_domain *sd, struct task_struct *p,
>  	if (prev_stats.has_capacity && prev_stats.nr_running < this_stats.nr_running+1)
>  		return false;
>  
> -	/* if this cache has capacity, come here */
> -	if (this_stats.has_capacity && this_stats.nr_running+1 < prev_stats.nr_running)
> -		return true;
> -
>  	/*
>  	 * Check to see if we can move the load without causing too much
>  	 * imbalance.
> @@ -5391,8 +5410,8 @@ wake_affine_llc(struct sched_domain *sd, struct task_struct *p,
>  	prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2;
>  	prev_eff_load *= this_stats.capacity;
>  
> -	this_eff_load *= this_stats.load + task_load;
> -	prev_eff_load *= prev_stats.load - task_load;
> +	this_eff_load *= this_stats.max_load + task_load;
> +	prev_eff_load *= prev_stats.min_load - task_load;
>  
>  	return this_eff_load <= prev_eff_load;
>  }

So I would really like a workload that needs this LLC/NUMA stuff.
Because I much prefer the simpler: 'on which of these two CPUs can I run
soonest' approach.