On a large system with many CPUs, using HPET as the clock source can
have a significant impact on the overall system performance because
of the following reasons:
1) There is a single HPET counter shared by all the CPUs.
2) HPET counter reading is a very slow operation.
Using HPET as the default clock source may happen when, for example,
the TSC clock calibration exceeds the allowable tolerance. Something
the performance slowdown can be so severe that the system may crash
because of a NMI watchdog soft lockup, for example.
This patch attempts to reduce HPET read contention by using the fact
that if more than one CPUs are trying to access HPET at the same time,
it will be more efficient if one CPU in the group reads the HPET
counter and shares it with the rest of the group instead of each
group member reads the HPET counter individually.
This is done by using a combination word with a sequence number and
a bit lock. The CPU that gets the bit lock will be responsible for
reading the HPET counter and update the sequence number. The others
will monitor the change in sequence number and grab the HPET counter
accordingly. This change is enabled on SMP configuration.
On a 4-socket Haswell-EX box with 72 cores (HT off), running the
AIM7 compute workload (1500 users) on a 4.6-rc1 kernel (HZ=1000)
with and without the patch has the following performance numbers
(with HPET or TSC as clock source):
TSC = 646515 jobs/min
HPET w/o patch = 566708 jobs/min
HPET with patch = 638791 jobs/min
The perf profile showed a reduction of the %CPU time consumed by
read_hpet from 4.99% without patch to 1.41% with patch.
On a 16-socket IvyBridge-EX system with 240 cores (HT on), on the
other hand, the performance numbers of the same benchmark were:
TSC = 3145329 jobs/min
HPET w/o patch = 1108537 jobs/min
HPET with patch = 3019934 jobs/min
The corresponding perf profile showed a drop of CPU consumption of
the read_hpet function from more than 34% to just 2.96%.
Signed-off-by: Waiman Long <[email protected]>
---
v2->v3:
- Make the hpet optimization the default for SMP configuration. So
no documentation change is needed.
- Remove threshold checking code as it should not be necessary and
can be potentially unsafe.
v1->v2:
- Reduce the CPU threshold to 32.
- Add a kernel parameter to explicitly enable or disable hpet
optimization.
- Change hpet_save.hpet type to u32 to make sure that read & write
is atomic on i386.
arch/x86/kernel/hpet.c | 83 ++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 83 insertions(+), 0 deletions(-)
diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index a1f0e4a..ff1f140 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -759,12 +759,95 @@ static int hpet_cpuhp_notify(struct notifier_block *n,
#endif
/*
+ * Reading the HPET counter is a very slow operation. If a large number of
+ * CPUs are trying to access the HPET counter simultaneously, it can cause
+ * massive delay and slow down system performance dramatically. This may
+ * happen when HPET is the default clock source instead of TSC. For a
+ * really large system with hundreds of CPUs, the slowdown may be so
+ * severe that it may actually crash the system because of a NMI watchdog
+ * soft lockup, for example.
+ *
+ * If multiple CPUs are trying to access the HPET counter at the same time,
+ * we don't actually need to read the counter multiple times. Instead, the
+ * other CPUs can use the counter value read by the first CPU in the group.
+ *
+ * A sequence number whose lsb is a lock bit is used to control which CPU
+ * has the right to read the HPET counter directly and which CPUs are going
+ * to get the indirect value read by the lock holder. For the later group,
+ * if the sequence number differs from the expected locked value, they
+ * can assume that the saved HPET value is up-to-date and return it.
+ */
+static struct {
+ /* Sequence number + bit lock */
+ int seq ____cacheline_aligned_in_smp;
+
+ /* Current HPET value */
+ u32 hpet ____cacheline_aligned_in_smp;
+} hpet_save;
+#define HPET_SEQ_LOCKED(seq) ((seq) & 1) /* Odd == locked */
+
+/*
* Clock source related code
*/
+#ifdef CONFIG_SMP
+static cycle_t read_hpet(struct clocksource *cs)
+{
+ int seq;
+
+ seq = READ_ONCE(hpet_save.seq);
+ if (!HPET_SEQ_LOCKED(seq)) {
+ int old, new = seq + 1;
+ unsigned long flags;
+
+ local_irq_save(flags);
+ /*
+ * Set the lock bit (lsb) to get the right to read HPET
+ * counter directly. If successful, read the counter, save
+ * its value, and increment the sequence number. Otherwise,
+ * increment the sequnce number to the expected locked value
+ * for comparison later on.
+ */
+ old = cmpxchg(&hpet_save.seq, seq, new);
+ if (old == seq) {
+ u32 time = hpet_readl(HPET_COUNTER);
+
+ WRITE_ONCE(hpet_save.hpet, time);
+
+ /* Unlock */
+ smp_store_release(&hpet_save.seq, new + 1);
+ local_irq_restore(flags);
+ return (cycle_t)time;
+ }
+ local_irq_restore(flags);
+ seq = new;
+ }
+
+ /*
+ * Wait until the locked sequence number changes which indicates
+ * that the saved HPET value is up-to-date.
+ */
+ while (READ_ONCE(hpet_save.seq) == seq) {
+ /*
+ * Since reading the HPET is much slower than a single
+ * cpu_relax() instruction, we use two here in an attempt
+ * to reduce the amount of cacheline contention in the
+ * hpet_save.seq cacheline.
+ */
+ cpu_relax();
+ cpu_relax();
+ }
+
+ return (cycle_t)READ_ONCE(hpet_save.hpet);
+}
+#else /* CONFIG_SMP */
+/*
+ * For UP
+ */
static cycle_t read_hpet(struct clocksource *cs)
{
return (cycle_t)hpet_readl(HPET_COUNTER);
}
+#endif /* CONFIG_SMP */
static struct clocksource clocksource_hpet = {
.name = "hpet",
--
1.7.1
On Mon, Apr 11, 2016 at 1:09 PM, Waiman Long <[email protected]> wrote:
> On a large system with many CPUs, using HPET as the clock source can
> have a significant impact on the overall system performance because
> of the following reasons:
> 1) There is a single HPET counter shared by all the CPUs.
> 2) HPET counter reading is a very slow operation.
>
> Using HPET as the default clock source may happen when, for example,
> the TSC clock calibration exceeds the allowable tolerance. Something
> the performance slowdown can be so severe that the system may crash
> because of a NMI watchdog soft lockup, for example.
>
> This patch attempts to reduce HPET read contention by using the fact
> that if more than one CPUs are trying to access HPET at the same time,
> it will be more efficient if one CPU in the group reads the HPET
> counter and shares it with the rest of the group instead of each
> group member reads the HPET counter individually.
>
> This is done by using a combination word with a sequence number and
> a bit lock. The CPU that gets the bit lock will be responsible for
> reading the HPET counter and update the sequence number. The others
> will monitor the change in sequence number and grab the HPET counter
> accordingly. This change is enabled on SMP configuration.
>
> On a 4-socket Haswell-EX box with 72 cores (HT off), running the
> AIM7 compute workload (1500 users) on a 4.6-rc1 kernel (HZ=1000)
> with and without the patch has the following performance numbers
> (with HPET or TSC as clock source):
>
> TSC = 646515 jobs/min
> HPET w/o patch = 566708 jobs/min
> HPET with patch = 638791 jobs/min
>
> The perf profile showed a reduction of the %CPU time consumed by
> read_hpet from 4.99% without patch to 1.41% with patch.
>
> On a 16-socket IvyBridge-EX system with 240 cores (HT on), on the
> other hand, the performance numbers of the same benchmark were:
>
> TSC = 3145329 jobs/min
> HPET w/o patch = 1108537 jobs/min
> HPET with patch = 3019934 jobs/min
>
> The corresponding perf profile showed a drop of CPU consumption of
> the read_hpet function from more than 34% to just 2.96%.
>
> Signed-off-by: Waiman Long <[email protected]>
> ---
> v2->v3:
> - Make the hpet optimization the default for SMP configuration. So
> no documentation change is needed.
> - Remove threshold checking code as it should not be necessary and
> can be potentially unsafe.
>
> v1->v2:
> - Reduce the CPU threshold to 32.
> - Add a kernel parameter to explicitly enable or disable hpet
> optimization.
> - Change hpet_save.hpet type to u32 to make sure that read & write
> is atomic on i386.
>
> arch/x86/kernel/hpet.c | 83 ++++++++++++++++++++++++++++++++++++++++++++++++
> 1 files changed, 83 insertions(+), 0 deletions(-)
>
> diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
> index a1f0e4a..ff1f140 100644
> --- a/arch/x86/kernel/hpet.c
> +++ b/arch/x86/kernel/hpet.c
> @@ -759,12 +759,95 @@ static int hpet_cpuhp_notify(struct notifier_block *n,
> #endif
>
> /*
> + * Reading the HPET counter is a very slow operation. If a large number of
> + * CPUs are trying to access the HPET counter simultaneously, it can cause
> + * massive delay and slow down system performance dramatically. This may
> + * happen when HPET is the default clock source instead of TSC. For a
> + * really large system with hundreds of CPUs, the slowdown may be so
> + * severe that it may actually crash the system because of a NMI watchdog
> + * soft lockup, for example.
> + *
> + * If multiple CPUs are trying to access the HPET counter at the same time,
> + * we don't actually need to read the counter multiple times. Instead, the
> + * other CPUs can use the counter value read by the first CPU in the group.
> + *
> + * A sequence number whose lsb is a lock bit is used to control which CPU
> + * has the right to read the HPET counter directly and which CPUs are going
> + * to get the indirect value read by the lock holder. For the later group,
> + * if the sequence number differs from the expected locked value, they
> + * can assume that the saved HPET value is up-to-date and return it.
> + */
> +static struct {
> + /* Sequence number + bit lock */
> + int seq ____cacheline_aligned_in_smp;
> +
> + /* Current HPET value */
> + u32 hpet ____cacheline_aligned_in_smp;
> +} hpet_save;
> +#define HPET_SEQ_LOCKED(seq) ((seq) & 1) /* Odd == locked */
> +
> +/*
> * Clock source related code
> */
> +#ifdef CONFIG_SMP
> +static cycle_t read_hpet(struct clocksource *cs)
> +{
> + int seq;
> +
> + seq = READ_ONCE(hpet_save.seq);
> + if (!HPET_SEQ_LOCKED(seq)) {
> + int old, new = seq + 1;
> + unsigned long flags;
> +
> + local_irq_save(flags);
> + /*
> + * Set the lock bit (lsb) to get the right to read HPET
> + * counter directly. If successful, read the counter, save
> + * its value, and increment the sequence number. Otherwise,
> + * increment the sequnce number to the expected locked value
> + * for comparison later on.
> + */
> + old = cmpxchg(&hpet_save.seq, seq, new);
> + if (old == seq) {
> + u32 time = hpet_readl(HPET_COUNTER);
> +
> + WRITE_ONCE(hpet_save.hpet, time);
I think a plain store is okay due to the smp_store_release below.
> +
> + /* Unlock */
> + smp_store_release(&hpet_save.seq, new + 1);
> + local_irq_restore(flags);
> + return (cycle_t)time;
> + }
> + local_irq_restore(flags);
> + seq = new;
> + }
> +
> + /*
> + * Wait until the locked sequence number changes which indicates
> + * that the saved HPET value is up-to-date.
> + */
> + while (READ_ONCE(hpet_save.seq) == seq) {
> + /*
> + * Since reading the HPET is much slower than a single
> + * cpu_relax() instruction, we use two here in an attempt
> + * to reduce the amount of cacheline contention in the
> + * hpet_save.seq cacheline.
> + */
> + cpu_relax();
> + cpu_relax();
> + }
> +
> + return (cycle_t)READ_ONCE(hpet_save.hpet);
> +}
I wonder if this could be simplified. Pseudocode:
u32 time;
unsigned long flags;
local_irq_save(flags);
if (spin_trylock(&hpet_lock)) {
time = hpet_readl(HPET_COUNTER);
WRITE_ONCE(last_hpet_counter, time);
} else {
spin_unlock_wait(&hpet_lock);
/* When this function started, hpet_lock was locked. Now it's
unlocked, which means that time is at least as new as whatever the
lock holder returned. */
time = READ_ONCE(last_hpet_counter);
}
local_irq_restore(flags);
return time;
Should be fasterunder heavy contention, too: spinlocks are very nicely
optimized.
I *think* this is sufficiently atomic.
--Andy
Hi Waiman,
[auto build test WARNING on tip/auto-latest]
[also build test WARNING on v4.6-rc3 next-20160411]
[if your patch is applied to the wrong git tree, please drop us a note to help improving the system]
url: https://github.com/0day-ci/linux/commits/Waiman-Long/x86-hpet-Reduce-HPET-counter-read-contention/20160412-041402
config: x86_64-randconfig-x011-201615 (attached as .config)
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64
All warnings (new ones prefixed by >>):
>> arch/x86/kernel/hpet.c:786:3: warning: 'hpet_save' defined but not used [-Wunused-variable]
} hpet_save;
^
vim +/hpet_save +786 arch/x86/kernel/hpet.c
770 * If multiple CPUs are trying to access the HPET counter at the same time,
771 * we don't actually need to read the counter multiple times. Instead, the
772 * other CPUs can use the counter value read by the first CPU in the group.
773 *
774 * A sequence number whose lsb is a lock bit is used to control which CPU
775 * has the right to read the HPET counter directly and which CPUs are going
776 * to get the indirect value read by the lock holder. For the later group,
777 * if the sequence number differs from the expected locked value, they
778 * can assume that the saved HPET value is up-to-date and return it.
779 */
780 static struct {
781 /* Sequence number + bit lock */
782 int seq ____cacheline_aligned_in_smp;
783
784 /* Current HPET value */
785 u32 hpet ____cacheline_aligned_in_smp;
> 786 } hpet_save;
787 #define HPET_SEQ_LOCKED(seq) ((seq) & 1) /* Odd == locked */
788
789 /*
790 * Clock source related code
791 */
792 #ifdef CONFIG_SMP
793 static cycle_t read_hpet(struct clocksource *cs)
794 {
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
On 04/11/2016 04:21 PM, Andy Lutomirski wrote:
> On Mon, Apr 11, 2016 at 1:09 PM, Waiman Long<[email protected]> wrote:
>> On a large system with many CPUs, using HPET as the clock source can
>> have a significant impact on the overall system performance because
>> of the following reasons:
>> 1) There is a single HPET counter shared by all the CPUs.
>> 2) HPET counter reading is a very slow operation.
>>
>> Using HPET as the default clock source may happen when, for example,
>> the TSC clock calibration exceeds the allowable tolerance. Something
>> the performance slowdown can be so severe that the system may crash
>> because of a NMI watchdog soft lockup, for example.
>>
>> This patch attempts to reduce HPET read contention by using the fact
>> that if more than one CPUs are trying to access HPET at the same time,
>> it will be more efficient if one CPU in the group reads the HPET
>> counter and shares it with the rest of the group instead of each
>> group member reads the HPET counter individually.
>>
>> This is done by using a combination word with a sequence number and
>> a bit lock. The CPU that gets the bit lock will be responsible for
>> reading the HPET counter and update the sequence number. The others
>> will monitor the change in sequence number and grab the HPET counter
>> accordingly. This change is enabled on SMP configuration.
>>
>> On a 4-socket Haswell-EX box with 72 cores (HT off), running the
>> AIM7 compute workload (1500 users) on a 4.6-rc1 kernel (HZ=1000)
>> with and without the patch has the following performance numbers
>> (with HPET or TSC as clock source):
>>
>> TSC = 646515 jobs/min
>> HPET w/o patch = 566708 jobs/min
>> HPET with patch = 638791 jobs/min
>>
>> The perf profile showed a reduction of the %CPU time consumed by
>> read_hpet from 4.99% without patch to 1.41% with patch.
>>
>> On a 16-socket IvyBridge-EX system with 240 cores (HT on), on the
>> other hand, the performance numbers of the same benchmark were:
>>
>> TSC = 3145329 jobs/min
>> HPET w/o patch = 1108537 jobs/min
>> HPET with patch = 3019934 jobs/min
>>
>> The corresponding perf profile showed a drop of CPU consumption of
>> the read_hpet function from more than 34% to just 2.96%.
>>
>> Signed-off-by: Waiman Long<[email protected]>
>> ---
>> v2->v3:
>> - Make the hpet optimization the default for SMP configuration. So
>> no documentation change is needed.
>> - Remove threshold checking code as it should not be necessary and
>> can be potentially unsafe.
>>
>> v1->v2:
>> - Reduce the CPU threshold to 32.
>> - Add a kernel parameter to explicitly enable or disable hpet
>> optimization.
>> - Change hpet_save.hpet type to u32 to make sure that read& write
>> is atomic on i386.
>>
>> arch/x86/kernel/hpet.c | 83 ++++++++++++++++++++++++++++++++++++++++++++++++
>> 1 files changed, 83 insertions(+), 0 deletions(-)
>>
>> diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
>> index a1f0e4a..ff1f140 100644
>> --- a/arch/x86/kernel/hpet.c
>> +++ b/arch/x86/kernel/hpet.c
>> @@ -759,12 +759,95 @@ static int hpet_cpuhp_notify(struct notifier_block *n,
>> #endif
>>
>> /*
>> + * Reading the HPET counter is a very slow operation. If a large number of
>> + * CPUs are trying to access the HPET counter simultaneously, it can cause
>> + * massive delay and slow down system performance dramatically. This may
>> + * happen when HPET is the default clock source instead of TSC. For a
>> + * really large system with hundreds of CPUs, the slowdown may be so
>> + * severe that it may actually crash the system because of a NMI watchdog
>> + * soft lockup, for example.
>> + *
>> + * If multiple CPUs are trying to access the HPET counter at the same time,
>> + * we don't actually need to read the counter multiple times. Instead, the
>> + * other CPUs can use the counter value read by the first CPU in the group.
>> + *
>> + * A sequence number whose lsb is a lock bit is used to control which CPU
>> + * has the right to read the HPET counter directly and which CPUs are going
>> + * to get the indirect value read by the lock holder. For the later group,
>> + * if the sequence number differs from the expected locked value, they
>> + * can assume that the saved HPET value is up-to-date and return it.
>> + */
>> +static struct {
>> + /* Sequence number + bit lock */
>> + int seq ____cacheline_aligned_in_smp;
>> +
>> + /* Current HPET value */
>> + u32 hpet ____cacheline_aligned_in_smp;
>> +} hpet_save;
>> +#define HPET_SEQ_LOCKED(seq) ((seq)& 1) /* Odd == locked */
>> +
>> +/*
>> * Clock source related code
>> */
>> +#ifdef CONFIG_SMP
>> +static cycle_t read_hpet(struct clocksource *cs)
>> +{
>> + int seq;
>> +
>> + seq = READ_ONCE(hpet_save.seq);
>> + if (!HPET_SEQ_LOCKED(seq)) {
>> + int old, new = seq + 1;
>> + unsigned long flags;
>> +
>> + local_irq_save(flags);
>> + /*
>> + * Set the lock bit (lsb) to get the right to read HPET
>> + * counter directly. If successful, read the counter, save
>> + * its value, and increment the sequence number. Otherwise,
>> + * increment the sequnce number to the expected locked value
>> + * for comparison later on.
>> + */
>> + old = cmpxchg(&hpet_save.seq, seq, new);
>> + if (old == seq) {
>> + u32 time = hpet_readl(HPET_COUNTER);
>> +
>> + WRITE_ONCE(hpet_save.hpet, time);
> I think a plain store is okay due to the smp_store_release below.
You were right. A plain store should be good enough.
>
>> +
>> + /* Unlock */
>> + smp_store_release(&hpet_save.seq, new + 1);
>> + local_irq_restore(flags);
>> + return (cycle_t)time;
>> + }
>> + local_irq_restore(flags);
>> + seq = new;
>> + }
>> +
>> + /*
>> + * Wait until the locked sequence number changes which indicates
>> + * that the saved HPET value is up-to-date.
>> + */
>> + while (READ_ONCE(hpet_save.seq) == seq) {
>> + /*
>> + * Since reading the HPET is much slower than a single
>> + * cpu_relax() instruction, we use two here in an attempt
>> + * to reduce the amount of cacheline contention in the
>> + * hpet_save.seq cacheline.
>> + */
>> + cpu_relax();
>> + cpu_relax();
>> + }
>> +
>> + return (cycle_t)READ_ONCE(hpet_save.hpet);
>> +}
> I wonder if this could be simplified. Pseudocode:
>
> u32 time;
> unsigned long flags;
>
> local_irq_save(flags);
>
> if (spin_trylock(&hpet_lock)) {
> time = hpet_readl(HPET_COUNTER);
> WRITE_ONCE(last_hpet_counter, time);
You will need a spin_unlock(&hpet_lock) here.
> } else {
> spin_unlock_wait(&hpet_lock);
> /* When this function started, hpet_lock was locked. Now it's
> unlocked, which means that time is at least as new as whatever the
> lock holder returned. */
> time = READ_ONCE(last_hpet_counter);
> }
>
> local_irq_restore(flags);
> return time;
>
> Should be fasterunder heavy contention, too: spinlocks are very nicely
> optimized.
I don't think it will be faster. The current spinlock code isn't more
optimized than what you can do with a cmpxchg and smp_store_release. In
fact, it is what the spinlock code is actually doing. Other differences
includes:
1) A CPU will not do local_irq_save/local_irq_restore when the lock is
not free.
2) My patch also use a change a sequence number to indicate an updated
time stamp is available. So there will be cases where CPUs running your
code will have to wait while the those running my code can grab the time
stamp and return immediately.
> I *think* this is sufficiently atomic.
>
> --Andy
Cheers,
Longman
On 04/12/2016 12:37 PM, Waiman Long wrote:
> On 04/11/2016 04:21 PM, Andy Lutomirski wrote:
>>
>>> +
>>> + /* Unlock */
>>> + smp_store_release(&hpet_save.seq, new + 1);
>>> + local_irq_restore(flags);
>>> + return (cycle_t)time;
>>> + }
>>> + local_irq_restore(flags);
>>> + seq = new;
>>> + }
>>> +
>>> + /*
>>> + * Wait until the locked sequence number changes which
>>> indicates
>>> + * that the saved HPET value is up-to-date.
>>> + */
>>> + while (READ_ONCE(hpet_save.seq) == seq) {
>>> + /*
>>> + * Since reading the HPET is much slower than a single
>>> + * cpu_relax() instruction, we use two here in an
>>> attempt
>>> + * to reduce the amount of cacheline contention in the
>>> + * hpet_save.seq cacheline.
>>> + */
>>> + cpu_relax();
>>> + cpu_relax();
>>> + }
>>> +
>>> + return (cycle_t)READ_ONCE(hpet_save.hpet);
>>> +}
>> I wonder if this could be simplified. Pseudocode:
>>
>> u32 time;
>> unsigned long flags;
>>
>> local_irq_save(flags);
>>
>> if (spin_trylock(&hpet_lock)) {
>> time = hpet_readl(HPET_COUNTER);
>> WRITE_ONCE(last_hpet_counter, time);
>
> You will need a spin_unlock(&hpet_lock) here.
>
>> } else {
>> spin_unlock_wait(&hpet_lock);
>> /* When this function started, hpet_lock was locked. Now it's
>> unlocked, which means that time is at least as new as whatever the
>> lock holder returned. */
>> time = READ_ONCE(last_hpet_counter);
>> }
>>
>> local_irq_restore(flags);
>> return time;
>>
>> Should be fasterunder heavy contention, too: spinlocks are very nicely
>> optimized.
>
> I don't think it will be faster. The current spinlock code isn't more
> optimized than what you can do with a cmpxchg and smp_store_release.
> In fact, it is what the spinlock code is actually doing. Other
> differences includes:
>
> 1) A CPU will not do local_irq_save/local_irq_restore when the lock is
> not free.
> 2) My patch also use a change a sequence number to indicate an updated
> time stamp is available. So there will be cases where CPUs running
> your code will have to wait while the those running my code can grab
> the time stamp and return immediately.
>
Moreover, if the timing is such that right after one CPU release the
lock, the next one get it immediately and there is a continuous stream
of incoming CPUs. It is possible that the ones that are waiting for the
lock to be free will see the lock being acquired for an indefinite
period of time. This is certainly not something we want to have and this
is what the sequence number is for.
Cheers,
Longman