Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
Message-ID: <77632c46-edd1-36bc-1f2b-d3c82b0d5ebc@amd.com>
Date:   Mon, 24 Jul 2023 13:11:51 +0530
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
 Thunderbird/91.13.1
Subject: Re: [RFC PATCH V3 1/1] sched/numa: Fix disjoint set vma scan
 regression
Content-Language: en-US
To:     Mel Gorman <mgorman@suse.de>
Cc:     linux-kernel@vger.kernel.org, linux-mm@kvack.org,
        Ingo Molnar <mingo@redhat.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Andrew Morton <akpm@linux-foundation.org>,
        David Hildenbrand <david@redhat.com>, rppt@kernel.org,
        Juri Lelli <juri.lelli@redhat.com>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Bharata B Rao <bharata@amd.com>,
        Aithal Srikanth <sraithal@amd.com>,
        kernel test robot <oliver.sang@intel.com>
References: <cover.1685506205.git.raghavendra.kt@amd.com>
 <8581ca937d4064b3cd138845d5bd418580d177da.1685506205.git.raghavendra.kt@amd.com>
 <20230721151834.yaymi6emrppl3mzl@suse.de>
From:   Raghavendra K T <raghavendra.kt@amd.com>
In-Reply-To: <20230721151834.yaymi6emrppl3mzl@suse.de>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
MIME-Version: 1.0
X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1
X-MS-Exchange-AntiSpam-MessageData-0: =?utf-8?B?M1JnNFFmcFZtM0NTdFJDRUtmbWdhQWtKWWc5YkNha2xhU2dvcUYveE9uVVZQ?=
 =?utf-8?B?S0tzT2k5MmhrajlVNldXUmN3cU4wVlQvOFNsUXRyZGdGZXl2bFVscUVJSkZW?=
 =?utf-8?B?Ym12dWowZVdXeVk3Q2QzMjQ2aERReW1ZOWo1NTZ6ZW10UHUvek9raUx0TlZD?=
 =?utf-8?B?K0lQVFRmT3cwV0plSVpWZHFqS24ycFYzT3dIWFlJalFEZXFDZU5QamlFbkps?=
 =?utf-8?B?UjV1aktGNExCSHl5dDlhSVVyclJrRXg3eUhrekZ6SWZQS2l5QmptWE02U1d5?=
 =?utf-8?B?UjRoQXR5SU9VL2dzbENuais5SThYOE8wMDRSYUZTcTBjbkkvQlVXRzVlVENL?=
 =?utf-8?B?eVR3TWluSVJReitjajJoYzVNbjJhSEN6ZFdGVTJmMk5rTEhFK1lrNzBjZnVL?=
 =?utf-8?B?eFIwUkpMR1JwbDdlYzAxTkhFUy9pb3lUaCsweFA5dUZRT1hxTHFxVVA1SjdM?=
 =?utf-8?B?RUpYc1ovUURGOXVEVm9PcFAwTnordFZjOEYxOEhwdkFveHhqcjlkMEcvVjNW?=
 =?utf-8?B?aXRUQ2FrdWtFdkc3bWFINWFJZGFVRTNUeDE1ZjhEaTRQbHZldHV4ekpPTW9S?=
 =?utf-8?B?T3l2eGdrRUljMGR1VkFsZ2M5UEcva25CVjN6QVgwc3NRUzV0MDkxenpPM1ha?=
 =?utf-8?B?VjMrYmdkQ2EwVk82MWZzbGdaajloQnlMbCtSU0tZZmk2djJBamxkNVBpOXVE?=
 =?utf-8?B?Ym5yODdPeklTYUdJeG1oOUN6VWhHSU4zK2hIV1JWbitnbmdaVUJxc1lsL2Fr?=
 =?utf-8?B?QmtkaXRWR1I3SEE1eTJtRHJ6OGFHakNLdEZvQmNST3ExRko3cFpsVmVEWnp0?=
 =?utf-8?B?UUNBN1Vvb2RZREFtcThLTi9QMVB4dzdKc0t5R0U5QnUweGlZbUNZeFJZeFpq?=
 =?utf-8?B?OVQrR25hZkJIb09oL1IxVEVBanhlcHQ0MEZIWVZOcmtjbDUwbTBFNFoxa3Mx?=
 =?utf-8?B?bXRndzlmTjZyUzBVZ1ZvZlk3MEtZZ2J3TXFsdzZPbzBWaVFWRGhwS3hCZkRB?=
 =?utf-8?B?bFpSa011TExvM3c4dndrcXZzS29jdWpnNkhnbmluUlo5OFVyS1NQV2ROV002?=
 =?utf-8?B?RmdKOWZWU0JEczRZQlc2WVcwWjJ0UjRqUUI1UjVlOFhWMTFPcmgvcHBCam4w?=
 =?utf-8?B?T3h6eDhHSE1qTVdlbGZ5STZyNklab0VUYWJQWGZ3UzVDVy9wclBmKzVuZFh6?=
 =?utf-8?B?a1dxS3Q1OUJtczVSWHhNTnltZkl5cU1CQ09NRHhkdGxXTXJVdC9CTytYUEdV?=
 =?utf-8?B?aU1SSnhQWXNIeWx1L1ltVzFCNklxbTB5dzVwdVUwa0VIS3IzYWh0Nm8zeFpD?=
 =?utf-8?B?elBsenkzNVlrMmRWQi9WL05FQlk4VHB6dXg2UU01ZmhwR0VDSGdKdE5yeGM3?=
 =?utf-8?B?NlpSNEFLd2wyOE1DVkhZN05lNnc4YnROeTlrTWZyaUY2UnBqY09rTXRSMkQr?=
 =?utf-8?B?d1hONVdueDIzMFl0Q1ZvVldjOGg4WU9KdzQvZDNwWnBIOTVkeThQbFIzRWty?=
 =?utf-8?B?ZmlzRmkvdTVlTVVUU0w1dlVTUURYeS9QSStrUEdaWTJTNlVpbVBRUitUZHNL?=
 =?utf-8?B?MDdBQStDWW9QL3pSRjhBTVJYRE5jazNzbmU4V1U1RFM0U21SZGJUUUFPVTdF?=
 =?utf-8?B?MXRUVkphVTlOTUZiUStTc1JKdThsbG1WbStGQVhBTXJPdjJjYVNaT0ZYL29v?=
 =?utf-8?B?Rjl4Z0ptc2FSQ2FrR05HUzJYMmxrVnVLa3YyaUJCd1lWTERLT3pFVkZZTXZH?=
 =?utf-8?B?aU5PcVRKOGdCcmtsS1ZKU214VVdOd21UQ1RMK1NjWFpvMDhZZmNLT040QXZy?=
 =?utf-8?B?TDhwaGNnK252WWZWS1QrOG14YlhZek1uNVR2VUUwblFZdkErd2JKZGVFY3A3?=
 =?utf-8?B?WEFXeHQvTHJXK2JESHFEbzIyZFNtZkhnaElZSDI3L214MDlUMVNueUhaVUtq?=
 =?utf-8?B?MGhzQ29CMG1ucjJqU0N1QlB4amFCS0VlTXlOV1F5cmlzRzhqU0VXczJPbkJn?=
 =?utf-8?B?K2ZvSjNhS0pzYndib3gwcktQd2NPMldnazVCQk9rQlVZN254bSt5QURza21D?=
 =?utf-8?B?RlZrY3VHY2FEazBaUDNRTFQvS0NGSHBuTEI4elN0bm5ncjUrREtVYVRURnIv?=
 =?utf-8?Q?vzH1cTKLH6oGmHzDprSwVOTQe?=
X-OriginatorOrg: amd.com
X-MS-Exchange-CrossTenant-Network-Message-Id: ecf63d03-9042-4f3c-0a8d-08db8c197b33
X-MS-Exchange-CrossTenant-AuthSource: MN2PR12MB3008.namprd12.prod.outlook.com
X-MS-Exchange-CrossTenant-AuthAs: Internal
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 24 Jul 2023 07:42:07.5681
 (UTC)
X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted
X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d
X-MS-Exchange-CrossTenant-MailboxType: HOSTED
X-MS-Exchange-CrossTenant-UserPrincipalName: uBwa+64I1LfwWWp8CiBr9Goqha/d0Vgtxz41LYJ8aMKHLONM86G7eFqgQ0lj22ZbHijGY/NMUoA4RFQCQtz8hA==
X-MS-Exchange-Transport-CrossTenantHeadersStamped: PH7PR12MB5949
X-Spam-Status: No, score=-2.2 required=5.0 tests=BAYES_00,DKIM_SIGNED,
        DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FORGED_SPF_HELO,NICE_REPLY_A,
        RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_PASS,SPF_NONE,
        T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
        lindbergh.monkeyblade.net
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On 7/21/2023 8:48 PM, Mel Gorman wrote:
> On Wed, May 31, 2023 at 09:55:26AM +0530, Raghavendra K T wrote:
>>   With the numa scan enhancements [1], only the threads which had previously
>> accessed vma are allowed to scan.
>>
> 
> I know it's very late to be reading this properly for the first time.  It's a
> brief review as it's late but I'm not 100% convinced by the patch as-is.
> 

Thank you Mel for the review.  Expert opinion / intuition from you,
PeterZ helped, helps alot to improve to solve the problem in better way 
finally.

> To start off -- reference the specific commit e.g. Since commit fc137c0ddab2
> ("sched/numa: enhance vma scanning logic")...
> 

Okay.

>> While this had improved significant system time overhead, there were corner
>> cases, which genuinely need some relaxation. For e.g.,
>>
>> 1) Concern raised by PeterZ, where if there are N partition sets of vmas
>> belonging to tasks, then unfairness in allowing these threads to scan could
>> potentially amplify the side effect of some of the vmas being left
>> unscanned.
>>
> 
> This needs a much better description as there is too much guesswork
> required to figure out what problem is being solved. 

Agree I needed to be more descriptive.

You may be referring
> to an issue where a VMA not accessed for a long time may be skipped for
> scanning indefinitely. 
I feel this is mostly a favorable case because we don't incur
overhead of scanning unneeded VMA. so not this, but below ones

You could also be referring to an issue where a
> highly active thread always traps the NUMA hinting fault first and other
> tasks never pass the vma_is_accessed test. It also could be the case that
> due to a malloc implementation behaviour that the thread using a particular
> VMA changes and it's not detected. It's simply not clear how the VMAs are
> partitioned or what exactly is unfair. I certainly can guess a number of
> reasons why a patch like this is necessary but in a years time, it'll be
> hard to be certain about the original intent.

Agree, to summarize
here there are multiple layers of issues

1) PeterZ pointed this

(Thread1/Set1)---->access(vma1) <== gets  65% opportunity

(Thread2/Set2)---->access(vma2) <== gets 35% opprtunity

Suppose Thread1 (or threads from Set1) are actively getting more 
scanning opportunity, we may miss or scan less of vma2.

I did some experiment and found there was indeed a bit of unfairness in
the way set1/set2 gets opportunity to scan.

Link: 
https://lore.kernel.org/lkml/c730dee0-a711-8a8e-3eb1-1bfdd21e6add@amd.com/

2) Suppose for the above case if we have
vma1, vma2 represents sum of vmas accessed by set1 and set2 respectively 
in disjoint way.

and vma1 size is far less than vma2 size, effect of unfairness in 
scanning gets amplified. e.g.,

(Thread1/Set1)---->access(vma1 size = 1MB)  <== actively scanning

(Thread2/Set2)---->access(vma2 size = 10MB)


3) malloc behavior you pointed.

I was trying to minimize the side effect of unfairness in scanning
overall by introducing a bit of unconditional scans.

(But the unfairness amongst the threads, which thread (or set of thread) 
gets opportunity to scan remains)

I may be missing some more cases here

> 
>> 2) Below reports of LKP numa01 benchmark regression.
>>
>> Currently this is handled by allowing first two scanning unconditional
>> as indicated by mm->numa_scan_seq. This is imprecise since for some
>> benchmark vma scanning might itself start at numa_scan_seq > 2.
>>
> 
> Well, it's also not useful in many cases. There is nothing special about
> the first two scans other than they happen early in the lifetime of the
> process. A major change in phase behaviour that occurs after 2 scans
> could miss the new access pattern.

Fully agree. That is why that condition is removed in the patch now.
What I also saw was some times when new vma gets created
mm->numa_scan_seq could be already more than 2.
(To handle that correctly I would have to have per vma start_numa_seq
kind of things etc which was messy and more space consuming)

Note:
Somewhat related, In my experiments what I saw is normal short
running benchmarks used to finish by numa_scan_seq < 16 whereas some 
benchmark  like NAS used to go till 100-150. (to set context for any 
long running  benchmark optimization in future)

> 
>> Solution:
>> Allow unconditional scanning of vmas of tasks depending on vma size. This
>> is achieved by maintaining a per vma scan counter, where
>>
>> f(allowed_to_scan) = f(scan_counter <  vma_size / scan_size)
>>
> 
> This is a vague description as well and does not mention that the
> scan_counter resets and the code does not appear to match the equation.
> It's also a bit arbitrary that unconditional scanning occurs for every task
> after the scan_count passes a threshold. The timing of when this occurs may
> vary and all threads sharing the address space may conduct the scanning
> which may be overkill. It should only be necessary for at least one task
> to unconditionally mark the VMA for hinting, no? I'm generally a bit wary
> that the time it takes to detect a VMA needs unconditonal scanning depends
> on the size as due to VMA-merging,

I do agree that changing of vma size due to merge/split can affect
scanning. But I was assuming that it is Okay because this calculation is
done every time we hit task_numa_work which may be wrong (and imprecise).

  it may be non-deterministic when a VMA
> gets scanned and there may be run-to-run variance to the timing of threads
> allocating address space >

Agree

>> Result:
>> numa01_THREAD_ALLOC result on 6.4.0-rc2 (that has numascan enhancement)
>>                  	base-numascan	base		base+fix
>> real    		1m1.507s	1m23.259s	1m2.632s
>> user    		213m51.336s	251m46.363s	220m35.528s
>> sys     		3m3.397s	0m12.492s	2m41.393s
>>
>> numa_hit 		5615517		4560123		4963875
>> numa_local 		5615505		4560024		4963700
>> numa_other 		12		99		175
>> numa_pte_updates 	1822797		493		1559111
>> numa_hint_faults 	1307113		523		1469031
>> numa_hint_faults_local 	612617		488		884829
>> numa_pages_migrated 	694370		35		584202
>>
> 
> It's not clear what these kernels are. Is base-numascan a 6.4-rc2 kernel
> with at least commit fc137c0ddab2 ("sched/numa: enhance vma scanning
> logic") reverted?

This is correct. To compare the system time overhead before commit
fc137c0ddab2 ("sched/numa: enhance vma scanning logic")

> The numa_[hint|local|other] stats are not helpful. They are updated on
> the allocation side and not related to actual behaviour.

will trim this next time.

> 
> As I'm unsure what the kernels are, I'm not sure how interesting it is
> that the pte update stats and number of faults trapped are very different
> for base to base+fix. 

Number of faults trapped are excessively refined for base (6.4.0-rc2 
vanilla) for this particular case, because of malloc behaviour you 
pointed above.

Sure enough, the overall runtime is lower so it's
> *probably* good but not sure as it may be also the case that disabling NUMA
> balancing would be faster again given the nature of this particular test :P
> 
> The testing is a little thin and I'd worry slightly that this patch is
> very specific to this particular workload.

I do agree that it is mostly trying to address LKP numa01_THREAD_ALLOC
case.
But regarding the testing though I reported only above result,
I do run
1) kernbench, autonuma from mmtest,
2) some microbenchmarks and also  NAS/hashjoin etc long running
benchmarks to ensure there are no surprises.

> 
>> Summary: Regression in base is recovered by allowing scanning as required.
>>
>> [1] https://lore.kernel.org/lkml/cover.1677672277.git.raghavendra.kt@amd.com/T/#t
>>
>> Fixes: fc137c0ddab2 ("sched/numa: enhance vma scanning logic")
>> regression.
> 
> Fixes may be a bit overkill as this patch is more of an enhancement than
> something that justifies a backport to -stable but I don't feel strongly
> about it.

Sure.

> 
>> Reported-by: Aithal Srikanth <sraithal@amd.com>
>> Reported-by: kernel test robot <oliver.sang@intel.com>
>> Closes: https://lore.kernel.org/lkml/db995c11-08ba-9abf-812f-01407f70a5d4@amd.com/T/
>> Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
>> ---
>>   include/linux/mm_types.h |  1 +
>>   kernel/sched/fair.c      | 31 ++++++++++++++++++++++++-------
>>   2 files changed, 25 insertions(+), 7 deletions(-)
>>
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 306a3d1a0fa6..992e460a713e 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -479,6 +479,7 @@ struct vma_numab_state {
>>   	unsigned long next_scan;
>>   	unsigned long next_pid_reset;
>>   	unsigned long access_pids[2];
>> +	unsigned int scan_counter;
>>   };
> 
> Vague name as it's not counting and gets reset. It might have been clearer
> to name is something like scan_selective with an initial value related to
> the VMA size that decrements. When it hits 0, the scan is forced *once*
> for the unlucky task. The suggested name is also bad, I'm terrible at
> naming but "counter" gives no hints either.
> 

Agree that naming is bad, will use scan_selective / or try if I get more
idea on naming.
So you are suggesting to try other-way where we force scan once we hit
zero. Looks fair to me. will try the idea and see how it goes.

>>   
>>   /*
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 373ff5f55884..4e71fb58085b 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -2931,17 +2931,30 @@ static void reset_ptenuma_scan(struct task_struct *p)
>>   static bool vma_is_accessed(struct vm_area_struct *vma)
>>   {
>>   	unsigned long pids;
>> +	unsigned int vma_size;
>> +	unsigned int scan_threshold;
>> +	unsigned int scan_size;
>> +
>> +	pids = vma->numab_state->access_pids[0] | vma->numab_state->access_pids[1];
>> +
>> +	if (test_bit(hash_32(current->pid, ilog2(BITS_PER_LONG)), &pids))
>> +		return true;
>> +
>> +	scan_size = READ_ONCE(sysctl_numa_balancing_scan_size);
>> +	/* vma size in MB */
>> +	vma_size = (vma->vm_end - vma->vm_start) >> 20;
>> +
>> +	/* Total scans needed to cover VMA */
>> +	scan_threshold = vma_size / scan_size;
>> +
>>   	/*
>> -	 * Allow unconditional access first two times, so that all the (pages)
>> -	 * of VMAs get prot_none fault introduced irrespective of accesses.
>> +	 * Allow the scanning of half of disjoint set's VMA to induce
>> +	 * prot_none fault irrespective of accesses.
>>   	 * This is also done to avoid any side effect of task scanning
>>   	 * amplifying the unfairness of disjoint set of VMAs' access.
>>   	 */
> 
> Much better description is required. It's not stated how the VMAs are
> disjoint. For example, they all belong to the same address space so from that
> perspective they are not disjoint. Clarify that it's related to accesses
> at the very least or be clear on how a set of VMAs can be disjoint. Also,
> it's not half the total VMAs or some other critieria either, it's related
> to scanning events. There is a lot of guesswork involved here.

Sure. will add more details.
ALso I was not clear here where does that half came from. Idea was
allowing unconditional scanning = vma_size for a unrelated task that has
not accessed vma is too much.. since we are  letting unconditional scans 
let it be reduced to atleast half of that. I will experiment a bit here
to get optimal f(vma_size/scan_size)

> 
>> -	if (READ_ONCE(current->mm->numa_scan_seq) < 2)
>> -		return true;
>> -
>> -	pids = vma->numab_state->access_pids[0] | vma->numab_state->access_pids[1];
>> -	return test_bit(hash_32(current->pid, ilog2(BITS_PER_LONG)), &pids);
>> +	scan_threshold = 1 + (scan_threshold >> 1);
>> +	return (vma->numab_state->scan_counter < scan_threshold);
>>   }
>>   
>>   #define VMA_PID_RESET_PERIOD (4 * sysctl_numa_balancing_scan_delay)
>> @@ -3058,6 +3071,8 @@ static void task_numa_work(struct callback_head *work)
>>   			/* Reset happens after 4 times scan delay of scan start */
>>   			vma->numab_state->next_pid_reset =  vma->numab_state->next_scan +
>>   				msecs_to_jiffies(VMA_PID_RESET_PERIOD);
>> +
>> +			vma->numab_state->scan_counter = 0;
>>   		}
>>   
>>   		/*
>> @@ -3084,6 +3099,8 @@ static void task_numa_work(struct callback_head *work)
>>   			vma->numab_state->access_pids[1] = 0;
>>   		}
>>   
>> +		vma->numab_state->scan_counter++;
>> +
>>   		do {
>>   			start = max(start, vma->vm_start);
>>   			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
> 
> This has a potential timing issue as only tasks that accessed the VMA
> update the counter. 

I think there is a confusion because of reverted logic. we do allow
unconditionally to let the scan happen initially till scan_counter 
reaches scan_threshold. So even though task did not access vma, we get
to scan initially.

So idea is update scan_counter only if we are able to successfully come 
till here because we are sure that there is not condition that returns 
back after this without scanning.

A task could create a VMA, access is heavily in an
> init phase and then go completely idle waiting on other tasks to complete
> (e.g. parallelised load that uses a single thread to setup the data and then
> creates worker threads to process that data). As there is no guarantee other
> tasks would trap a fault, the access bits would never be set, the counter
> never increments and the VMA is ignored forever. It's late on a Friday so
> maybe I'm wrong but it looks like this could have weird corner cases.
> 
> The counter should at least increase if the VMA could have been scanned
> except the access bits were not set with the caveat that it may incur
> excessive cache misses.
> 

Explained above

Thanks a lot,
Raghu