Received: by 2002:a25:4158:0:0:0:0:0 with SMTP id o85csp983663yba; Wed, 15 May 2019 13:29:50 -0700 (PDT) X-Google-Smtp-Source: APXvYqw6d0buqcFPs0jfPk2lZnCqAXAoq/r8ouiHg1CVp4k7V8vMk0yovwbpGuw0X2v0upnT5bWT X-Received: by 2002:a63:8dc8:: with SMTP id z191mr46864734pgd.9.1557952190852; Wed, 15 May 2019 13:29:50 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1557952190; cv=none; d=google.com; s=arc-20160816; b=DwHm2Oa1eefn5BFnX2EGvCLZcmwnqzmHEXPmowU1XyzDW+TJB1BvAmJ6xJIAI+JrtJ I1OyVQqbI16JMzENHUJtQyQP6pKHgpU1kY7/hxlLHOEdPHVv07d3TFOGvnO6tWiWRpil 3DgiYwq3DZg9wPVqjBrID0clf7K3iLHGOOvCVWNw1gy2o3j/vh1crjEQLmXaXTZN9UCB FcrE2J3cYBtyUjvkFb2TNR6SyCiuvy6Un5GaTDsJ1sbPskYQ61+9pYxr7tZ5a8jth2c1 Tno9ka45X1z85ixSTcY2cdcdsQArPl4v1MEftPwX3lPYm2PrHj4ulRiSBScgZ+Zjaaeo 4DAQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=thmTHZKQprJi5+XgxcxFODlseTRaHYXCMvH4Nzh4iHA=; b=MfuDYll0MG66swwIxRBWMeKwjaNLHjYHYrCzF240Vp3QMtY5PM2YXavAY6s7fqyDbl gNtaOuxWWEWKCr4LeF4R6Mto9XPjmQ4VqHqB7MRkp84wZ70trJtrkMbl44UMDgYLtVv9 8HyD8rXaRIs68mBr0+JPi9dGTgJKqQbi7/N2s/og3dVZi8DfUaShBItHU6Kl3SrvcL4r s2AgkaywnTy7sWzUO2jG20HzrSsQ/QWtDgu0GmjiXhYgYxAg8uXFLxfFjNkiYt0YBGky ImydzJvKmRs75k5WQ8VuucAsNdTSMXgS1CHavaQrMaMmQgHBr/T8JXDnBP0WBkDCKjXq MJ4Q== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 7si2807445pll.99.2019.05.15.13.29.35; Wed, 15 May 2019 13:29:50 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727725AbfEOU0o (ORCPT + 99 others); Wed, 15 May 2019 16:26:44 -0400 Received: from mx1.redhat.com ([209.132.183.28]:52372 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726170AbfEOU0n (ORCPT ); Wed, 15 May 2019 16:26:43 -0400 Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.phx2.redhat.com [10.5.11.22]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 51A3BC05D266; Wed, 15 May 2019 20:26:43 +0000 (UTC) Received: from amt.cnet (ovpn-112-4.gru2.redhat.com [10.97.112.4]) by smtp.corp.redhat.com (Postfix) with ESMTP id 88EDC1001DE1; Wed, 15 May 2019 20:26:40 +0000 (UTC) Received: from amt.cnet (localhost [127.0.0.1]) by amt.cnet (Postfix) with ESMTP id 89F41105183; Wed, 15 May 2019 17:26:24 -0300 (BRT) Received: (from marcelo@localhost) by amt.cnet (8.14.7/8.14.7/Submit) id x4FKQKTh010444; Wed, 15 May 2019 17:26:20 -0300 Date: Wed, 15 May 2019 17:26:20 -0300 From: Marcelo Tosatti To: Wanpeng Li Cc: Konrad Rzeszutek Wilk , kvm-devel , LKML , Thomas Gleixner , Ingo Molnar , Andrea Arcangeli , Bandan Das , Paolo Bonzini Subject: Re: [PATCH] sched: introduce configurable delay before entering idle Message-ID: <20190515202618.GA31128@amt.cnet> References: <20190507185647.GA29409@amt.cnet> <20190514135022.GD4392@amt.cnet> <20190514152015.GM20906@char.us.oracle.com> <20190514174235.GA12269@amt.cnet> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) X-Scanned-By: MIMEDefang 2.84 on 10.5.11.22 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.31]); Wed, 15 May 2019 20:26:43 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, May 15, 2019 at 09:42:48AM +0800, Wanpeng Li wrote: > On Wed, 15 May 2019 at 02:20, Marcelo Tosatti wrote: > > > > On Tue, May 14, 2019 at 11:20:15AM -0400, Konrad Rzeszutek Wilk wrote: > > > On Tue, May 14, 2019 at 10:50:23AM -0300, Marcelo Tosatti wrote: > > > > On Mon, May 13, 2019 at 05:20:37PM +0800, Wanpeng Li wrote: > > > > > On Wed, 8 May 2019 at 02:57, Marcelo Tosatti wrote: > > > > > > > > > > > > > > > > > > Certain workloads perform poorly on KVM compared to baremetal > > > > > > due to baremetal's ability to perform mwait on NEED_RESCHED > > > > > > bit of task flags (therefore skipping the IPI). > > > > > > > > > > KVM supports expose mwait to the guest, if it can solve this? > > > > > > > > > > Regards, > > > > > Wanpeng Li > > > > > > > > Unfortunately mwait in guest is not feasible (uncompatible with multiple > > > > guests). Checking whether a paravirt solution is possible. > > > > > > There is the obvious problem with that the guest can be malicious and > > > provide via the paravirt solution bogus data. That is it expose 0% CPU > > > usage but in reality be mining and using 100%. > > > > The idea is to have a hypercall for the guest to perform the > > need_resched=1 bit set. It can only hurt itself. > > This lets me recall the patchset from aliyun > https://lkml.org/lkml/2017/6/22/296 Thanks for the pointer. "The background is that we(Alibaba Cloud) do get more and more complaints from our customers in both KVM and Xen compare to bare-mental. After investigations, the root cause is known to us: big cost in message passing workload(David show it in KVM forum 2015) A typical message workload like below: vcpu 0 vcpu 1 1. send ipi 2. doing hlt 3. go into idle 4. receive ipi and wake up from hlt 5. write APIC time twice 6. write APIC time twice to to stop sched timer reprogram sched timer 7. doing hlt 8. handle task and send ipi to vcpu 0 9. same to 4. 10. same to 3" This is very similar to the client/server example pair included in the first message. > They poll after > __current_set_polling() in do_idle() so avoid this hypercall I think. Yes, i was thinking about a variant without poll. > Btw, do you get SAP HANA by 5-10% bonus even if adaptive halt-polling > is enabled? host = 31.18 halt_poll_ns set to 200000 = 38.55 (80%) halt_poll_ns set to 300000 = 33.28 (93%) idle_spin set to 220000 = 32.22 (96%) So avoiding the IPI VM-exits is faster. 300000 is the optimal value vfor this workload. Haven't checked adaptive halt-polling.