MIME-Version: 1.0
In-Reply-To: <1459951324-53339-1-git-send-email-Waiman.Long@hpe.com>
References: <1459951324-53339-1-git-send-email-Waiman.Long@hpe.com>
From: Andy Lutomirski <luto@amacapital.net>
Date: Wed, 6 Apr 2016 21:58:07 -0700
Message-ID: <CALCETrWhbsASNkpYW3DL-c76bTdE2rEmcimn6TunQL87Kx0XuA@mail.gmail.com>
Subject: Re: [PATCH] x86/hpet: Reduce HPET counter read contention
To: Waiman Long <Waiman.Long@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>,
        "H. Peter Anvin" <hpa@zytor.com>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        X86 ML <x86@kernel.org>, Jiang Liu <jiang.liu@linux.intel.com>,
        Borislav Petkov <bp@suse.de>, Andy Lutomirski <luto@kernel.org>,
        Scott J Norton <scott.norton@hpe.com>,
        Douglas Hatch <doug.hatch@hpe.com>, Randy Wright <rwright@hpe.com>
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 3969
Lines: 85

On Wed, Apr 6, 2016 at 7:02 AM, Waiman Long <Waiman.Long@hpe.com> wrote:
> On a large system with many CPUs, using HPET as the clock source can
> have a significant impact on the overall system performance because
> of the following reasons:
>  1) There is a single HPET counter shared by all the CPUs.
>  2) HPET counter reading is a very slow operation.
>
> Using HPET as the default clock source may happen when, for example,
> the TSC clock calibration exceeds the allowable tolerance. Something
> the performance slowdown can be so severe that the system may crash
> because of a NMI watchdog soft lockup, for example.
>
> This patch attempts to reduce HPET read contention by using the fact
> that if more than one task are trying to access HPET at the same time,
> it will be more efficient if one task in the group reads the HPET
> counter and shares it with the rest of the group instead of each
> group member reads the HPET counter individually.
>
> This is done by using a combination word with a sequence number and
> a bit lock. The task that gets the bit lock will be responsible for
> reading the HPET counter and update the sequence number. The others
> will monitor the change in sequence number and grab the HPET counter
> accordingly.
>
> On a 4-socket Haswell-EX box with 72 cores (HT off), running the
> AIM7 compute workload (1500 users) on a 4.6-rc1 kernel (HZ=1000)
> with and without the patch has the following performance numbers
> (with HPET or TSC as clock source):
>
> TSC             = 646515 jobs/min
> HPET w/o patch  = 566708 jobs/min
> HPET with patch = 638791 jobs/min
>
> The perf profile showed a reduction of the %CPU time consumed by
> read_hpet from 4.99% without patch to 1.41% with patch.
>
> On a 16-socket IvyBridge-EX system with 240 cores (HT on), on the
> other hand, the performance numbers of the same benchmark were:
>
> TSC             = 3145329 jobs/min
> HPET w/o patch  = 1108537 jobs/min
> HPET with patch = 3019934 jobs/min
>
> The corresponding perf profile showed a drop of CPU consumption of
> the read_hpet function from more than 34% to just 2.96%.
>
> Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
> ---
>  arch/x86/kernel/hpet.c |  110 +++++++++++++++++++++++++++++++++++++++++++++++-
>  1 files changed, 109 insertions(+), 1 deletions(-)
>
> diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
> index a1f0e4a..9e3de73 100644
> --- a/arch/x86/kernel/hpet.c
> +++ b/arch/x86/kernel/hpet.c
> @@ -759,11 +759,112 @@ static int hpet_cpuhp_notify(struct notifier_block *n,
>  #endif
>
>  /*
> + * Reading the HPET counter is a very slow operation. If a large number of
> + * CPUs are trying to access the HPET counter simultaneously, it can cause
> + * massive delay and slow down system performance dramatically. This may
> + * happen when HPET is the default clock source instead of TSC. For a
> + * really large system with hundreds of CPUs, the slowdown may be so
> + * severe that it may actually crash the system because of a NMI watchdog
> + * soft lockup, for example.
> + *
> + * If multiple CPUs are trying to access the HPET counter at the same time,
> + * we don't actually need to read the counter multiple times. Instead, the
> + * other CPUs can use the counter value read by the first CPU in the group.
> + *
> + * A sequence number whose lsb is a lock bit is used to control which CPU
> + * has the right to read the HPET counter directly and which CPUs are going
> + * to get the indirect value read by the lock holder. For the later group,
> + * if the sequence number differs from the expected locked value, they
> + * can assume that the saved HPET value is up-to-date and return it.
> + *
> + * This mechanism is only activated on system with a large number of CPUs.
> + * Currently, it is enabled when nr_cpus > 64.
> + */

Reading the HPET is so slow that all the atomic ops in the world won't
make a dent.  Why not just turn this optimization on unconditionally?

--Andy