Message-ID: <6b6bd9fb-32dd-343f-1791-c6637b52caf9@huaweicloud.com>
Date:   Mon, 6 Mar 2023 08:58:26 +0100
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
 Thunderbird/102.7.0
Subject: Re: lock_torture results for different patches:
Content-Language: en-US
To:     Antonio Paolillo <antonio.paolillo@huawei.com>, longman@redhat.com
Cc:     David.Laight@ACULAB.COM, akpm@osdl.org, arjan@linux.intel.com,
        boqun.feng@gmail.com, diogo.behrens@huawei.com,
        hernanl.leon@huawei.com, joel@joelfernandes.org,
        jonas.oberhauser@huawei.com, jonas.oberhauser@huaweicloud.com,
        linux-kernel@vger.kernel.org, mingo@redhat.com, paulmck@kernel.org,
        peterz@infradead.org, stable@vger.kernel.org,
        stern@rowland.harvard.edu, tglx@linutronix.de, will@kernel.org
References: <4122ef0d-1508-8ce2-df80-874565a612ce@redhat.com>
 <20230301163214.17530-1-antonio.paolillo@huawei.com>
From:   Hernan Ponce de Leon <hernan.poncedeleon@huaweicloud.com>
In-Reply-To: <20230301163214.17530-1-antonio.paolillo@huawei.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Precedence: bulk

Thanks Antonio for the performance results. Taking both correctness and 
performance into consideration, this is my current understanding of the 
situation.

Removing the acquire barrier does not improve performance and while it 
might be unnecessary, it seems better to play it safe and keep it. That 
being said, a comment about this should be added to the code (Paul 
suggested something like this at some point).

Neither of the proposed fixes to the data race significantly affect 
performance. Using READ_ONCE shows better performance in the AMD 
machine, but using atomic_long_or shows the opposite in the HiSilicon 
one. However, the difference is more prominent using READ_ONCE.

I am not sure I got David's comment. Are you proposing to change the 
definition of rt_mutex_base to

struct rt_mutex_base {
	raw_spinlock_t		wait_lock;
	struct rb_root_cached   waiters;
	/* This the waiting task_struct | RT_MUTEX_HAS_WAITERS. */
	atomic_long_t 		*owner;
};

I don't fully understand the consequences of this. At minimum this would 
require changes in several methods such as the one below, right?

static __always_inline bool rt_mutex_cmpxchg_release(struct 
rt_mutex_base *lock,
						     atomic_long_t  *old,
						     atomic_long_t  *new)

Is there anything else that I am missing?

Hernan

On 3/1/2023 5:32 PM, Antonio Paolillo wrote:
> Dear all,
> 
> I want to provide some support to Hernan regarding performance claims.
> 
> I used lock_torture to evaluate the different proposed patches on two
> different server machines:
> - a Huawei TaiShan 200 (Model 2280) rack server that has 128 GB of RAM
>    and 2x Kunpeng 920-4826 processors, a HiSilicon chip with 48 ARMv8.2
>    64-bit cores totaling 96 cores (no SMT) [1, 2],
>    denoted as taishan200-96c;
> - a GIGABYTE R182-Z91-00 rack server that has 128 GB of RAM and 2x
>    EPYC 7352 processors, an AMD chip with 24 x86_64 cores, totaling 48
>    cores (96 CPUs when counting hyperthreading) [3, 4],
>    denoted as gigabyte-96c.
> 
> I ran the evaluation on a Ubuntu 22.04 distro, with custom kernels based
> on v6.2-rc6 (6d796c50f84ca79f1722bb131799e5a5710c4700).
> The different kernels are combination of patches:
> - (0) Stock kernel;
> - (1) With relaxed set owner barrier (as discussed in [5] and questioned
>    by Peter, the barrier seems not to be needed);
> - (2) With READ_ONCE(), as originally proposed in this thread;
> - (3) With atomic_long_or() as proposed by Peter;
> - (4) With relaxed set owner barrier and READ_ONCE();
> - (5) With relaxed set owner barrier and atomic_long_or().
> 
> I ran lock_torture several times, exploring the following parameter
> space:
> - torture_type="rtmutex_lock",
> - nwriters_stress=[1, 2, 3, 4, 8, 16, 32, 64, 95],
> - stat_interval=4,
> - stutter=0,
> - shuffle_interval=0.
> For each value of "nwriters_stress", I ran the configuration 5 times.
> 
> By feeding the lock_torture kthread pids to "taskset -p", I overruled
> the scheduling such that the distribution of kthreads to CPUs is fixed.
> I also disabled "irq balance" and "numa balance" daemons, fixed the
> frequency to 1.5GHz using the "userspace" cpufreq governor and isolated
> all the cores used (using isolcpus=1-95 at boot-time) to avoid any
> source of interference.
> 
> As a warm-up phase, I ignored the first reported results and only
> considered the latest 60 seconds of execution (after all kthreads
> migrated to their final CPU).
> The reported throughput is computed by dividing the reported number of
> operations by the duration of the measurement for each dot (60 seconds),
> so higher is better.
> 
> Here follows the results on taishan200-96c (the 'rel' column is the mean
> relative to the mean of the stock kernel, in percent, and each mean is
> the average over 5 independent runs):
> 
> Kernel:             k0-stock-6.2.0-rc6       k1-rmacq             k2-readonce             k3-alongor             k4-rmacq+readonce             k5-rmacq+alongor
> Statistic (kops/s):               mean   std     mean   std   rel        mean   std   rel       mean   std   rel              mean   std   rel             mean   std   rel
> nwriters_stress:
> 1                               899.91 24.95   880.10 29.62   -2%      871.57 44.27   -3%     888.65 37.90   -1%            898.63 29.82   -0%           889.83 25.64   -1%
> 2                               359.30 25.92   416.83 32.77  +16%      360.65 28.32   +0%     404.79 42.64  +13%            380.65 21.29   +6%           404.37 23.27  +13%
> 3                               314.97 24.32   308.41  9.68   -2%      315.00  9.97   +0%     313.86 13.47   -0%            313.47  4.01   -0%           322.77 20.82   +2%
> 4                               328.02 15.09   330.65 29.33   +1%      314.83 24.28   -4%     305.71 12.72   -7%            322.95 10.39   -2%           343.32 13.73   +5%
> 8                               292.16 22.03   288.85 10.50   -1%      288.28 18.84   -1%     285.42 24.58   -2%            310.23 26.08   +6%           285.67 20.03   -2%
> 16                              297.03 26.89   281.89 29.22   -5%      265.19 33.73  -11%     279.02 22.43   -6%            284.40 36.21   -4%           285.21 36.33   -4%
> 32                              187.36 28.59   175.71 19.77   -6%      186.44 48.15   -0%     206.59 14.11  +10%            174.08 24.30   -7%           185.80 45.12   -1%
> 64                              148.13 48.65   172.48 34.29  +16%      154.59 47.05   +4%     164.22 29.81  +11%            142.13 47.40   -4%           136.39 29.95   -8%
> 95                              174.35 57.89   148.59 38.03  -15%      156.85 43.64  -10%     132.92 32.35  -24%            126.44 28.24  -27%           146.82 60.04  -16%
> 
> And the results on gigabyte-96c:
> 
> Kernel:             k0-stock-6.2.0-rc6       k1-rmacq               k2-readonce             k3-alongor             k4-rmacq+readonce             k5-rmacq+alongor
> Statistic (kops/s):               mean   std     mean    std    rel        mean   std   rel       mean   std   rel              mean   std   rel             mean    std  rel
> nwriters_stress:
> 1                               713.72 25.68   707.32  17.73    -1%      718.81 12.63   +1%     712.80 13.57   -0%            709.17 14.10   -1%           730.33   9.14  +2%
> 2                               376.25  8.19   400.09  16.24    +6%      396.71 26.09   +5%     412.61 17.80  +10%            396.48  7.02   +5%           409.90  14.61  +9%
> 3                               415.07 16.83   410.19  19.82    -1%      423.39  9.68   +2%     417.28 10.23   +1%            424.94 17.48   +2%           422.92  11.75  +2%
> 4                               286.77 26.63   285.13   6.80    -1%      297.33 23.62   +4%     296.49 16.60   +3%            303.99 30.38   +6%           296.93   9.90  +4%
> 8                               296.56 20.45   308.97  12.53    +4%      305.49 19.91   +3%     294.24 17.24   -1%            294.71 24.03   -1%           294.09  25.20  -1%
> 16                              257.34 33.94   266.03  29.60    +3%      270.72 35.22   +5%     252.28 50.16   -2%            263.83 45.84   +3%           247.42  41.01  -4%
> 32                              278.78 51.45   215.35  68.40   -23%      259.77 87.44   -7%     217.26 79.67  -22%            201.23 70.46  -28%           282.47 116.65  +1%
> 64                               75.82 64.87   194.52 137.19  +157%       35.57 12.14  -53%      74.24 72.04   -2%             71.29 45.55   -6%            77.93  43.57  +3%
> 95                               60.37 68.13   198.38 116.93  +229%       43.12 17.60  -29%      58.80 36.47   -3%             57.78 63.00   -4%            61.33  71.18  +2%
> 
> We can safely conclude that the patches do not significatively affect
> the throughput of the lock_torture benchmark for rtmutex_lock.
> The values for nwriters_stress>=64 can safely be ignored as they are too
> spread.
> 
> Please notice that I pushed a landing page [6] with results in HTML that
> may be more convenient to browse together with interactive charts.
> 
> Cheers,
> 
> Antonio
> 
> [1] https://e.huawei.com/uk/products/servers/taishan-server/taishan-2280-v2
> [2] https://en.wikichip.org/wiki/hisilicon/kunpeng/920-4826
> [3] https://www.gigabyte.com/Rack-Server/R182-Z91-rev-100
> [4] https://www.amd.com/en/products/cpu/amd-epyc-7352
> [5] https://lkml.org/lkml/2023/1/22/160
> [6] https://antonio.paolillo.be/public/rtlocks-locktorture-patches.html