2022-03-25 03:02:15

by Ray Fucillo

[permalink] [raw]
Subject: Re: scalability regressions related to hugetlb_fault() changes


> On Mar 24, 2022, at 6:41 PM, Mike Kravetz <[email protected]> wrote:
>
> I also seem to remember thinking about the possibility of
> avoiding the synchronization if pmd sharing was not possible. That may be
> a relatively easy way to speed things up. Not sure if pmd sharing comes
> into play in your customer environments, my guess would be yes (shared
> mappings ranges more than 1GB in size and aligned to 1GB).

Hi Mike,

This is one very large shared memory segment allocated at database startup. It's common for it to be hundreds of GB. We allocate it with shmget() passing SHM_HUGETLB (when huge pages have been reserved for us). Not sure if that answers...

> Also, do you have any specifics about the regressions your customers are
> seeing? Specifically what paths are holding i_mmap_rwsem in write mode
> for long periods of time. I would expect something related to unmap.
> Truncation can have long hold times especially if there are may shared
> mapping. Always worth checking specifics, but more likely this is a general
> issue.

We've seen the write lock originate from calling shmat(), shmdt() and process exit. We've also seen it from a fork() off of one of the processes that are attached to the shared memory segment. Some evidence suggests that fork is a more costly case. However, while there are some important places where we'd use fork(), it's more unusual because most process creation will vfork() and execv() a new database process (which then attaches with shmat()).


2022-03-25 17:55:01

by Mike Kravetz

[permalink] [raw]
Subject: Re: scalability regressions related to hugetlb_fault() changes

On 3/24/22 17:02, Ray Fucillo wrote:
>
>> On Mar 24, 2022, at 6:41 PM, Mike Kravetz <[email protected]> wrote:
>>
>> I also seem to remember thinking about the possibility of
>> avoiding the synchronization if pmd sharing was not possible. That may be
>> a relatively easy way to speed things up. Not sure if pmd sharing comes
>> into play in your customer environments, my guess would be yes (shared
>> mappings ranges more than 1GB in size and aligned to 1GB).
>
> Hi Mike,
>
> This is one very large shared memory segment allocated at database startup. It's common for it to be hundreds of GB. We allocate it with shmget() passing SHM_HUGETLB (when huge pages have been reserved for us). Not sure if that answers...

Yes, so there would be shared pmds for that large shared mapping. I assume
this is x86 or arm64 which are the only architectures which support shared
pmds.

So, the easy change of "don't take semaphore if pmd sharing is not possible"
would not apply.

>> Also, do you have any specifics about the regressions your customers are
>> seeing? Specifically what paths are holding i_mmap_rwsem in write mode
>> for long periods of time. I would expect something related to unmap.
>> Truncation can have long hold times especially if there are may shared
>> mapping. Always worth checking specifics, but more likely this is a general
>> issue.
>
> We've seen the write lock originate from calling shmat(), shmdt() and process exit. We've also seen it from a fork() off of one of the processes that are attached to the shared memory segment. Some evidence suggests that fork is a more costly case. However, while there are some important places where we'd use fork(), it's more unusual because most process creation will vfork() and execv() a new database process (which then attaches with shmat()).

Thanks.

I will continue to look at this. A quick check of the fork code shows the
semaphore held in read mode for the duration of the page table copy.
--
Mike Kravetz

2022-03-25 19:13:16

by Ray Fucillo

[permalink] [raw]
Subject: Re: scalability regressions related to hugetlb_fault() changes


> On Mar 25, 2022, at 12:40 AM, Mike Kravetz <[email protected]> wrote:
>
> I will continue to look at this. A quick check of the fork code shows the
> semaphore held in read mode for the duration of the page table copy.

Thank you for looking into it.

As a side note about fork() for context, and not to distract from the
regression at hand... There's some history here where we ran into problems
circa 2005 where fork time went linear with the size of shared memory, and
that was resolved by letting the pages fault in the child. This was when
hugetlb was pretty new (and not used by us) and I see now that the fix
explicitly excluded hugetlb. Anyway, we now mostly use vfork(), only fork()
in some special cases, and improving just fork wouldn't fix the scalability
regression for us. But, it does sound like fork() time might be getting
large again now that everyone is using very large shared segments with
hugetlb, but generally haven't switched to 1GB pages. That old thread is:
https://lkml.org/lkml/2005/8/24/190

2022-03-28 21:22:58

by Mike Kravetz

[permalink] [raw]
Subject: Re: scalability regressions related to hugetlb_fault() changes

On 3/25/22 06:33, Ray Fucillo wrote:
>
>> On Mar 25, 2022, at 12:40 AM, Mike Kravetz <[email protected]> wrote:
>>
>> I will continue to look at this. A quick check of the fork code shows the
>> semaphore held in read mode for the duration of the page table copy.
>
> Thank you for looking into it.
>

Adding some mm people on cc:
Just a quick update on some thoughts and possible approach.

Note that regressions were noted when code was originally added to take
i_mmap_rwsem at fault time. A limited way of addressing the issue was
proposed here:
https://lore.kernel.org/linux-mm/[email protected]/

I do not think such a change would help in this case as hugetlb pages are used
via a shared memory segment. Hence, sharing and pmd sharing is happening.

After some thought, I believe the synchronization needed for pmd sharing
as outlined in commit c0d0381ade79 is limited to a single address space/mm_struct. We only need to worry about one thread of a process causing
an unshare while another thread in the same process is faulting. That is
because the unshare only tears down the page tables in the calling process.
Also, the page table modifications associated pmd sharing are constrained
by the virtual address range of a vma describing the sharable area.
Therefore, pmd sharing synchronization can be done at the vma level.

My 'plan' is to hang a rw_sema off the vm_private_data of hugetlb vmas that
can possibly have shared pmds. We will use this new semaphore instead of
i_mmap_rwsem at fault and pmd_unshare time. The only time we should see
contention on this semaphore is if one thread of a process is doing something
to cause unsharing for an address range while another thread is faulting in
the same range. This seems unlikely, and much much less common than one
process unmapping pages while another process wants to fault them in on a
large shared area.

There will also be a little code shuffling as the fault code is also
synchronized with truncation and hole punch via i_mmap_rwsem. But, this is
much easier to address.

Comments or other suggestions welcome.
--
Mike Kravetz