Received-SPF: pass (google.com: domain of linux-kernel+bounces-32712-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) client-ip=147.75.199.223;
Message-ID: <14e38e95-2bc6-4571-b502-4e3954b4bcc4@linux.dev>
Date: Mon, 22 Jan 2024 18:12:53 +0800
Precedence: bulk
MIME-Version: 1.0
Subject: Re: [PATCH v4 6/7] hugetlb: parallelize 2M hugetlb allocation and
 initialization
Content-Language: en-US
To: Muchun Song <muchun.song@linux.dev>, David Hildenbrand
 <david@redhat.com>, David Rientjes <rientjes@google.com>,
 Mike Kravetz <mike.kravetz@oracle.com>,
 Andrew Morton <akpm@linux-foundation.org>,
 Tim Chen <tim.c.chen@linux.intel.com>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
 ligang.bdlg@bytedance.com
References: <20240118123911.88833-1-gang.li@linux.dev>
 <20240118123911.88833-7-gang.li@linux.dev>
 <ddf37da4-4cbc-478a-be9b-3060b0aebc90@linux.dev>
From: Gang Li <gang.li@linux.dev>
In-Reply-To: <ddf37da4-4cbc-478a-be9b-3060b0aebc90@linux.dev>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit

On 2024/1/22 15:10, Muchun Song wrote:> On 2024/1/18 20:39, Gang Li wrote:
>> +static void __init hugetlb_alloc_node(unsigned long start, unsigned 
>> long end, void *arg)
>>   {
>> -    unsigned long i;
>> +    struct hstate *h = (struct hstate *)arg;
>> +    int i, num = end - start;
>> +    nodemask_t node_alloc_noretry;
>> +    unsigned long flags;
>> +    int next_node = 0;
> 
> This should be first_online_node which may be not zero.
> 

That's right. Thanks!

>> -    for (i = 0; i < h->max_huge_pages; ++i) {
>> -        if (!alloc_bootmem_huge_page(h, NUMA_NO_NODE))
>> +    /* Bit mask controlling how hard we retry per-node allocations.*/
>> +    nodes_clear(node_alloc_noretry);
>> +
>> +    for (i = 0; i < num; ++i) {
>> +        struct folio *folio = alloc_pool_huge_folio(h, 
>> &node_states[N_MEMORY],
>> +                        &node_alloc_noretry, &next_node);
>> +        if (!folio)
>>               break;
>> +        spin_lock_irqsave(&hugetlb_lock, flags);
>  > I suspect there will more contention on this lock when parallelizing.

In the worst case, there are only 'numa node number' of threads in
contention. And in my testing, it doesn't degrade performance, but
rather improves performance due to the reduced granularity.

> I want to know why you chose to drop prep_and_add_allocated_folios()
> call in the original hugetlb_pages_alloc_boot()?

Splitting him to parallelize hugetlb_vmemmap_optimize_folios.

>> +static unsigned long __init hugetlb_pages_alloc_boot(struct hstate *h)
>> +{
>> +    struct padata_mt_job job = {
>> +        .fn_arg        = h,
>> +        .align        = 1,
>> +        .numa_aware    = true
>> +    };
>> +
>> +    job.thread_fn    = hugetlb_alloc_node;
>> +    job.start    = 0;
>> +    job.size    = h->max_huge_pages;
>> +    job.min_chunk    = h->max_huge_pages / num_node_state(N_MEMORY) / 2;
>> +    job.max_threads    = num_node_state(N_MEMORY) * 2;
> 
> I am curious the magic number of 2 used in assignments of ->min_chunk
> and ->max_threads, does it from your experiment? I thinke it should
> be a comment here.
> 

This is tested and I can perform more detailed tests and provide data.

> And I am also sceptical about the optimization for a small amount of
> allocation of hugepages. Given 4 hugepags needed to be allocated on UMA
> system, job.min_chunk will be 2, job.max_threads will be 2. Then, 2
> workers will be scheduled, however each worker will just allocate 2 pages,
> how much the cost of scheduling? What if allocate 4 pages in single
> worker? Do you have any numbers on parallelism vs non-parallelism in
> a small allocation case? If we cannot gain from this case, I think we shold
> assign a reasonable value to ->min_chunk based on experiment.
> 
> Thanks.
>

That's a good suggestion, I'll run some tests and choose the best
values.