Pavel Tatashin, Ying Huang, and I are excited to be organizing a performance and scalability microconference this year at Plumbers[*], which is happening in Vancouver this year. The microconference is scheduled for the morning of the second day (Wed, Nov 14).
We have a preliminary agenda and a list of confirmed and interested attendees (cc'ed), and are seeking more of both!
Some of the items on the agenda as it stands now are:
- Promoting huge page usage: With memory sizes becoming ever larger, huge pages are becoming more and more important to reduce TLB misses and the overhead of memory management itself--that is, to make the system scalable with the memory size. But there are still some remaining gaps that prevent huge pages from being deployed in some situations, such as huge page allocation latency and memory fragmentation.
- Reducing the number of users of mmap_sem: This semaphore is frequently used throughout the kernel. In order to facilitate scaling this longstanding bottleneck, these uses should be documented and unnecessary users should be fixed.
- Parallelizing cpu-intensive kernel work: Resolve problems of past approaches including extra threads interfering with other processes, playing well with power management, and proper cgroup accounting for the extra threads. Bonus topic: proper accounting of workqueue threads running on behalf of cgroups.
- Preserving userland during kexec with a hibernation-like mechanism.
These center around our interests, but having lots of topics to choose from ensures we cover what's most important to the community, so we would like to hear about additional topics and extensions to those listed here. This includes, but is certainly not limited to, work in progress that would benefit from in-person discussion, real-world performance problems, and experimental and academic work.
If you haven't already done so, please let us know if you are interested in attending, or have suggestions for other attendees.
Thanks,
Daniel
[*] https://blog.linuxplumbersconf.org/2018/performance-mc/
On Tue, Sep 04, 2018 at 05:28:13PM -0400, Daniel Jordan wrote:
> Pavel Tatashin, Ying Huang, and I are excited to be organizing a performance and scalability microconference this year at Plumbers[*], which is happening in Vancouver this year. The microconference is scheduled for the morning of the second day (Wed, Nov 14).
>
> We have a preliminary agenda and a list of confirmed and interested attendees (cc'ed), and are seeking more of both!
>
> Some of the items on the agenda as it stands now are:
>
> - Promoting huge page usage: With memory sizes becoming ever larger, huge pages are becoming more and more important to reduce TLB misses and the overhead of memory management itself--that is, to make the system scalable with the memory size. But there are still some remaining gaps that prevent huge pages from being deployed in some situations, such as huge page allocation latency and memory fragmentation.
>
> - Reducing the number of users of mmap_sem: This semaphore is frequently used throughout the kernel. In order to facilitate scaling this longstanding bottleneck, these uses should be documented and unnecessary users should be fixed.
>
> - Parallelizing cpu-intensive kernel work: Resolve problems of past approaches including extra threads interfering with other processes, playing well with power management, and proper cgroup accounting for the extra threads. Bonus topic: proper accounting of workqueue threads running on behalf of cgroups.
>
> - Preserving userland during kexec with a hibernation-like mechanism.
Just some crazy idea: have you considered using checkpoint-restore as a
replacement or an addition to hibernation?
> These center around our interests, but having lots of topics to choose from ensures we cover what's most important to the community, so we would like to hear about additional topics and extensions to those listed here. This includes, but is certainly not limited to, work in progress that would benefit from in-person discussion, real-world performance problems, and experimental and academic work.
>
> If you haven't already done so, please let us know if you are interested in attending, or have suggestions for other attendees.
>
> Thanks,
> Daniel
>
> [*] https://blog.linuxplumbersconf.org/2018/performance-mc/
>
--
Sincerely yours,
Mike.
On Tue, 4 Sep 2018, Daniel Jordan wrote:
> - Promoting huge page usage: With memory sizes becoming ever larger, huge
> pages are becoming more and more important to reduce TLB misses and the
> overhead of memory management itself--that is, to make the system scalable
> with the memory size. But there are still some remaining gaps that prevent
> huge pages from being deployed in some situations, such as huge page
> allocation latency and memory fragmentation.
You forgot the major issue that huge pages in the page cache are not
supported and thus we have performance issues with fast NVME drives that
are now able to do 3Gbytes per sec that are only possible to reach with
directio and huge pages.
IMHO the huge page issue is just the reflection of a certain hardware
manufacturer inflicting pain for over a decade on its poor users by not
supporting larger base page sizes than 4k. No such workarounds needed on
platforms that support large sizes. Things just zoom along without
contortions necessary to deal with huge pages etc.
Can we come up with a 2M base page VM or something? We have possible
memory sizes of a couple TB now. That should give us a million or so 2M
pages to work with.
> - Reducing the number of users of mmap_sem: This semaphore is frequently
> used throughout the kernel. In order to facilitate scaling this longstanding
> bottleneck, these uses should be documented and unnecessary users should be
> fixed.
Large page sizes also reduce contention there.
> If you haven't already done so, please let us know if you are interested in
> attending, or have suggestions for other attendees.
Certainly interested in attending but this overlaps supercomputing 2018 in
Dallas Texas...
On 05/09/2018 17:10, Christopher Lameter wrote:
> On Tue, 4 Sep 2018, Daniel Jordan wrote:
>
>> - Promoting huge page usage: With memory sizes becoming ever larger, huge
>> pages are becoming more and more important to reduce TLB misses and the
>> overhead of memory management itself--that is, to make the system scalable
>> with the memory size. But there are still some remaining gaps that prevent
>> huge pages from being deployed in some situations, such as huge page
>> allocation latency and memory fragmentation.
>
> You forgot the major issue that huge pages in the page cache are not
> supported and thus we have performance issues with fast NVME drives that
> are now able to do 3Gbytes per sec that are only possible to reach with
> directio and huge pages.
>
> IMHO the huge page issue is just the reflection of a certain hardware
> manufacturer inflicting pain for over a decade on its poor users by not
> supporting larger base page sizes than 4k. No such workarounds needed on
> platforms that support large sizes. Things just zoom along without
> contortions necessary to deal with huge pages etc.
>
> Can we come up with a 2M base page VM or something? We have possible
> memory sizes of a couple TB now. That should give us a million or so 2M
> pages to work with.
>
>> - Reducing the number of users of mmap_sem: This semaphore is frequently
>> used throughout the kernel. In order to facilitate scaling this longstanding
>> bottleneck, these uses should be documented and unnecessary users should be
>> fixed.
>
>
> Large page sizes also reduce contention there.
That's true for the page fault path, but for process's actions manipulating the
memory process's layout (mmap,munmap,madvise,mprotect) the impact is minimal
unless the code has to manipulate the page tables.
>> If you haven't already done so, please let us know if you are interested in
>> attending, or have suggestions for other attendees.
>
> Certainly interested in attending but this overlaps supercomputing 2018 in
> Dallas Texas...
>
On Wed, 5 Sep 2018, Laurent Dufour wrote:
> > Large page sizes also reduce contention there.
>
> That's true for the page fault path, but for process's actions manipulating the
> memory process's layout (mmap,munmap,madvise,mprotect) the impact is minimal
> unless the code has to manipulate the page tables.
Well if you compare having to operate on 4k instead of 64k then the impact
is 16xs for larger memory ranges. For smaller operations this may not be
that significant. But then I thought we were talking about large areas of
memory.
On 9/5/18 2:38 AM, Mike Rapoport wrote:
> On Tue, Sep 04, 2018 at 05:28:13PM -0400, Daniel Jordan wrote:
>> Pavel Tatashin, Ying Huang, and I are excited to be organizing a performance and scalability microconference this year at Plumbers[*], which is happening in Vancouver this year. The microconference is scheduled for the morning of the second day (Wed, Nov 14).
>>
>> We have a preliminary agenda and a list of confirmed and interested attendees (cc'ed), and are seeking more of both!
>>
>> Some of the items on the agenda as it stands now are:
>>
>> - Promoting huge page usage: With memory sizes becoming ever larger, huge pages are becoming more and more important to reduce TLB misses and the overhead of memory management itself--that is, to make the system scalable with the memory size. But there are still some remaining gaps that prevent huge pages from being deployed in some situations, such as huge page allocation latency and memory fragmentation.
>>
>> - Reducing the number of users of mmap_sem: This semaphore is frequently used throughout the kernel. In order to facilitate scaling this longstanding bottleneck, these uses should be documented and unnecessary users should be fixed.
>>
>> - Parallelizing cpu-intensive kernel work: Resolve problems of past approaches including extra threads interfering with other processes, playing well with power management, and proper cgroup accounting for the extra threads. Bonus topic: proper accounting of workqueue threads running on behalf of cgroups.
>>
>> - Preserving userland during kexec with a hibernation-like mechanism.
>
> Just some crazy idea: have you considered using checkpoint-restore as a
> replacement or an addition to hibernation?
Hi Mike,
Yes, this is one way I was thinking about, and use kernel to pass the
application stored state to new kernel in pmem. The only problem is that
we waste memory: when there is not enough system memory to copy and pass
application state to new kernel this scheme won't work. Think about DB
that occupies 80% of system memory and we want to checkpoint/restore it.
So, we need to have another way, where the preserved memory is the
memory that is actually used by the applications, not copied. One easy
way is to give each application that has a large state that is expensive
to recreate a persistent memory device and let applications to keep its
state on that device (say /dev/pmemN). The only problem is that memory
on that device must be accessible just as fast as regular memory without
any file system overhead and hopefully without need for DAX.
I just want to get some ideas of what people are thinking about this,
and what would be the best way to achieve it.
Pavel
>
>> These center around our interests, but having lots of topics to choose from ensures we cover what's most important to the community, so we would like to hear about additional topics and extensions to those listed here. This includes, but is certainly not limited to, work in progress that would benefit from in-person discussion, real-world performance problems, and experimental and academic work.
>>
>> If you haven't already done so, please let us know if you are interested in attending, or have suggestions for other attendees.
>>
>> Thanks,
>> Daniel
>>
>> [*] https://blog.linuxplumbersconf.org/2018/performance-mc/
>>
>
On Wed, 5 Sep 2018, Laurent Dufour wrote:
> On 05/09/2018 17:10, Christopher Lameter wrote:
> > Large page sizes also reduce contention there.
>
> That's true for the page fault path, but for process's actions manipulating the
> memory process's layout (mmap,munmap,madvise,mprotect) the impact is minimal
> unless the code has to manipulate the page tables.
And how exactly are you going to do any of those operations _without_
manipulating the page tables?
Thanks,
tglx
Hi, Christopher,
Christopher Lameter <[email protected]> writes:
> On Tue, 4 Sep 2018, Daniel Jordan wrote:
>
>> - Promoting huge page usage: With memory sizes becoming ever larger, huge
>> pages are becoming more and more important to reduce TLB misses and the
>> overhead of memory management itself--that is, to make the system scalable
>> with the memory size. But there are still some remaining gaps that prevent
>> huge pages from being deployed in some situations, such as huge page
>> allocation latency and memory fragmentation.
>
> You forgot the major issue that huge pages in the page cache are not
> supported and thus we have performance issues with fast NVME drives that
> are now able to do 3Gbytes per sec that are only possible to reach with
> directio and huge pages.
Yes. That is an important gap for huge page. Although we have huge
page cache support for tmpfs, we lacks that for normal file systems.
> IMHO the huge page issue is just the reflection of a certain hardware
> manufacturer inflicting pain for over a decade on its poor users by not
> supporting larger base page sizes than 4k. No such workarounds needed on
> platforms that support large sizes. Things just zoom along without
> contortions necessary to deal with huge pages etc.
>
> Can we come up with a 2M base page VM or something? We have possible
> memory sizes of a couple TB now. That should give us a million or so 2M
> pages to work with.
That sounds a good idea. Don't know whether someone has tried this.
>> - Reducing the number of users of mmap_sem: This semaphore is frequently
>> used throughout the kernel. In order to facilitate scaling this longstanding
>> bottleneck, these uses should be documented and unnecessary users should be
>> fixed.
>
>
> Large page sizes also reduce contention there.
Yes.
>> If you haven't already done so, please let us know if you are interested in
>> attending, or have suggestions for other attendees.
>
> Certainly interested in attending but this overlaps supercomputing 2018 in
> Dallas Texas...
Sorry to know this. It appears that there are too many conferences in
November...
Best Regards,
Huang, Ying
Hi,
On Wed, Sep 05, 2018 at 07:51:34PM +0000, Pasha Tatashin wrote:
>
> On 9/5/18 2:38 AM, Mike Rapoport wrote:
> > On Tue, Sep 04, 2018 at 05:28:13PM -0400, Daniel Jordan wrote:
> >> Pavel Tatashin, Ying Huang, and I are excited to be organizing a performance and scalability microconference this year at Plumbers[*], which is happening in Vancouver this year. The microconference is scheduled for the morning of the second day (Wed, Nov 14).
> >>
> >> We have a preliminary agenda and a list of confirmed and interested attendees (cc'ed), and are seeking more of both!
> >>
> >> Some of the items on the agenda as it stands now are:
> >>
> >> - Promoting huge page usage: With memory sizes becoming ever larger, huge pages are becoming more and more important to reduce TLB misses and the overhead of memory management itself--that is, to make the system scalable with the memory size. But there are still some remaining gaps that prevent huge pages from being deployed in some situations, such as huge page allocation latency and memory fragmentation.
> >>
> >> - Reducing the number of users of mmap_sem: This semaphore is frequently used throughout the kernel. In order to facilitate scaling this longstanding bottleneck, these uses should be documented and unnecessary users should be fixed.
> >>
> >> - Parallelizing cpu-intensive kernel work: Resolve problems of past approaches including extra threads interfering with other processes, playing well with power management, and proper cgroup accounting for the extra threads. Bonus topic: proper accounting of workqueue threads running on behalf of cgroups.
> >>
> >> - Preserving userland during kexec with a hibernation-like mechanism.
> >
> > Just some crazy idea: have you considered using checkpoint-restore as a
> > replacement or an addition to hibernation?
>
> Hi Mike,
>
> Yes, this is one way I was thinking about, and use kernel to pass the
> application stored state to new kernel in pmem. The only problem is that
> we waste memory: when there is not enough system memory to copy and pass
> application state to new kernel this scheme won't work. Think about DB
> that occupies 80% of system memory and we want to checkpoint/restore it.
>
> So, we need to have another way, where the preserved memory is the
> memory that is actually used by the applications, not copied. One easy
> way is to give each application that has a large state that is expensive
> to recreate a persistent memory device and let applications to keep its
> state on that device (say /dev/pmemN). The only problem is that memory
> on that device must be accessible just as fast as regular memory without
> any file system overhead and hopefully without need for DAX.
Like hibernation, checkpoint persists the state, so it won't require
additional memory. At the restore time, the memory state is recreated from
the persistent checkpoint and of course it's slower than the regular
memory access, but it won't differ much from resuming from hibernation.
Maybe it would be possible to preserve applications state if we extend
suspend-to-RAM -> resume with the ability to load a new kernel during
resume...
> I just want to get some ideas of what people are thinking about this,
> and what would be the best way to achieve it.
>
> Pavel
>
>
> >
> >> These center around our interests, but having lots of topics to choose from ensures we cover what's most important to the community, so we would like to hear about additional topics and extensions to those listed here. This includes, but is certainly not limited to, work in progress that would benefit from in-person discussion, real-world performance problems, and experimental and academic work.
> >>
> >> If you haven't already done so, please let us know if you are interested in attending, or have suggestions for other attendees.
> >>
> >> Thanks,
> >> Daniel
> >>
> >> [*] https://blog.linuxplumbersconf.org/2018/performance-mc/
> >>
> >
--
Sincerely yours,
Mike.
On 06/09/2018 01:01, Thomas Gleixner wrote:
> On Wed, 5 Sep 2018, Laurent Dufour wrote:
>> On 05/09/2018 17:10, Christopher Lameter wrote:
>>> Large page sizes also reduce contention there.
>>
>> That's true for the page fault path, but for process's actions manipulating the
>> memory process's layout (mmap,munmap,madvise,mprotect) the impact is minimal
>> unless the code has to manipulate the page tables.
>
> And how exactly are you going to do any of those operations _without_
> manipulating the page tables?
I agree, at one time the page tables would have to be manipulated, and this is
mostly done under the protection of the page table locks - should the mmap_sem
still being held then ?
I was thinking about all the processing done on the VMAs, accounting, etc.
That part, usually not manipulating the page tables, is less dependent of the
underlying page size.
But I agree at one time of the processing, the page table are manipulated and
dealing with larger pages is better then.
Thanks,
Laurent.
On Thu, 6 Sep 2018, Huang, Ying wrote:
> > Certainly interested in attending but this overlaps supercomputing 2018 in
> > Dallas Texas...
>
> Sorry to know this. It appears that there are too many conferences in
> November...
I will try to get to it in the middle of SC18, RDMA track and RISC V
track... Sign me up.
On 09/05/2018 06:58 PM, Huang, Ying wrote:
> Hi, Christopher,
>
> Christopher Lameter <[email protected]> writes:
>
>> On Tue, 4 Sep 2018, Daniel Jordan wrote:
>>
>>> - Promoting huge page usage: With memory sizes becoming ever larger, huge
>>> pages are becoming more and more important to reduce TLB misses and the
>>> overhead of memory management itself--that is, to make the system scalable
>>> with the memory size. But there are still some remaining gaps that prevent
>>> huge pages from being deployed in some situations, such as huge page
>>> allocation latency and memory fragmentation.
>>
>> You forgot the major issue that huge pages in the page cache are not
>> supported and thus we have performance issues with fast NVME drives that
>> are now able to do 3Gbytes per sec that are only possible to reach with
>> directio and huge pages.
>
> Yes. That is an important gap for huge page. Although we have huge
> page cache support for tmpfs, we lacks that for normal file systems.
>
>> IMHO the huge page issue is just the reflection of a certain hardware
>> manufacturer inflicting pain for over a decade on its poor users by not
>> supporting larger base page sizes than 4k. No such workarounds needed on
>> platforms that support large sizes. Things just zoom along without
>> contortions necessary to deal with huge pages etc.
>>
>> Can we come up with a 2M base page VM or something? We have possible
>> memory sizes of a couple TB now. That should give us a million or so 2M
>> pages to work with.
>
> That sounds a good idea. Don't know whether someone has tried this.
IIRC, Hugh Dickins and some others at Google tried going down this path.
There was a brief discussion at LSF/MM. It is something I too would like
to explore in my spare time.
--
Mike Kravetz
On Thu, Sep 6, 2018 at 2:36 PM Mike Kravetz <[email protected]> wrote:
>
> On 09/05/2018 06:58 PM, Huang, Ying wrote:
> > Hi, Christopher,
> >
> > Christopher Lameter <[email protected]> writes:
> >
> >> On Tue, 4 Sep 2018, Daniel Jordan wrote:
> >>
> >>> - Promoting huge page usage: With memory sizes becoming ever larger, huge
> >>> pages are becoming more and more important to reduce TLB misses and the
> >>> overhead of memory management itself--that is, to make the system scalable
> >>> with the memory size. But there are still some remaining gaps that prevent
> >>> huge pages from being deployed in some situations, such as huge page
> >>> allocation latency and memory fragmentation.
> >>
> >> You forgot the major issue that huge pages in the page cache are not
> >> supported and thus we have performance issues with fast NVME drives that
> >> are now able to do 3Gbytes per sec that are only possible to reach with
> >> directio and huge pages.
> >
> > Yes. That is an important gap for huge page. Although we have huge
> > page cache support for tmpfs, we lacks that for normal file systems.
> >
> >> IMHO the huge page issue is just the reflection of a certain hardware
> >> manufacturer inflicting pain for over a decade on its poor users by not
> >> supporting larger base page sizes than 4k. No such workarounds needed on
> >> platforms that support large sizes. Things just zoom along without
> >> contortions necessary to deal with huge pages etc.
> >>
> >> Can we come up with a 2M base page VM or something? We have possible
> >> memory sizes of a couple TB now. That should give us a million or so 2M
> >> pages to work with.
> >
> > That sounds a good idea. Don't know whether someone has tried this.
>
> IIRC, Hugh Dickins and some others at Google tried going down this path.
> There was a brief discussion at LSF/MM. It is something I too would like
> to explore in my spare time.
Almost: I never tried that path myself, but mentioned that Greg Thelen had.
Hugh
Christopher Lameter <[email protected]> writes:
> On Thu, 6 Sep 2018, Huang, Ying wrote:
>
>> > Certainly interested in attending but this overlaps supercomputing 2018 in
>> > Dallas Texas...
>>
>> Sorry to know this. It appears that there are too many conferences in
>> November...
>
> I will try to get to it in the middle of SC18, RDMA track and RISC V
> track... Sign me up.
Great to know this! Looking forward to meet you in the microconference.
Best Regards,
Huang, Ying
On 9/4/18 2:28 PM, Daniel Jordan wrote:
> Pavel Tatashin, Ying Huang, and I are excited to be organizing a performance and scalability microconference this year at Plumbers[*], which is happening in Vancouver this year. The microconference is scheduled for the morning of the second day (Wed, Nov 14).
>
> We have a preliminary agenda and a list of confirmed and interested attendees (cc'ed), and are seeking more of both!
>
> Some of the items on the agenda as it stands now are:
>
> - Promoting huge page usage: With memory sizes becoming ever larger, huge pages are becoming more and more important to reduce TLB misses and the overhead of memory management itself--that is, to make the system scalable with the memory size. But there are still some remaining gaps that prevent huge pages from being deployed in some situations, such as huge page allocation latency and memory fragmentation.
>
> - Reducing the number of users of mmap_sem: This semaphore is frequently used throughout the kernel. In order to facilitate scaling this longstanding bottleneck, these uses should be documented and unnecessary users should be fixed.
>
> - Parallelizing cpu-intensive kernel work: Resolve problems of past approaches including extra threads interfering with other processes, playing well with power management, and proper cgroup accounting for the extra threads. Bonus topic: proper accounting of workqueue threads running on behalf of cgroups.
>
> - Preserving userland during kexec with a hibernation-like mechanism.
>
> These center around our interests, but having lots of topics to choose from ensures we cover what's most important to the community, so we would like to hear about additional topics and extensions to those listed here. This includes, but is certainly not limited to, work in progress that would benefit from in-person discussion, real-world performance problems, and experimental and academic work.
>
> If you haven't already done so, please let us know if you are interested in attending, or have suggestions for other attendees.
Hi Daniel and all,
I'm interested in the first 3 of those 4 topics, so if it doesn't conflict with HMM topics or
fix-gup-with-dma topics, I'd like to attend. GPUs generally need to access large chunks of
memory, and that includes migrating (dma-copying) pages around.
So for example a multi-threaded migration of huge pages between normal RAM and GPU memory is an
intriguing direction (and I realize that it's a well-known topic, already). Doing that properly
(how many threads to use?) seems like it requires scheduler interaction.
It's also interesting that there are two main huge page systems (THP and Hugetlbfs), and I sometimes
wonder the obvious thing to wonder: are these sufficiently different to warrant remaining separate,
long-term? Yes, I realize they're quite different in some ways, but still, one wonders. :)
thanks,
--
John Hubbard
NVIDIA
On 09/08/2018 12:13 AM, John Hubbard wrote:
>
> Hi Daniel and all,
>
> I'm interested in the first 3 of those 4 topics, so if it doesn't conflict with HMM topics or
> fix-gup-with-dma topics, I'd like to attend. GPUs generally need to access large chunks of
> memory, and that includes migrating (dma-copying) pages around.
>
> So for example a multi-threaded migration of huge pages between normal RAM and GPU memory is an
> intriguing direction (and I realize that it's a well-known topic, already). Doing that properly
> (how many threads to use?) seems like it requires scheduler interaction.
>
> It's also interesting that there are two main huge page systems (THP and Hugetlbfs), and I sometimes
> wonder the obvious thing to wonder: are these sufficiently different to warrant remaining separate,
> long-term? Yes, I realize they're quite different in some ways, but still, one wonders. :)
One major difference between hugetlbfs and THP is that the former has to
be explicitly managed by the applications that use it whereas the latter
is done automatically without the applications being aware that THP is
being used at all. Performance wise, THP may or may not increase
application performance depending on the exact memory access pattern,
though the chance is usually higher that an application will benefit
than suffer from it.
If an application know what it is doing, using hughtblfs can boost
performance more than it can ever achieved by THP. Many large enterprise
applications, like Oracle DB, are using hugetlbfs and explicitly disable
THP. So unless THP can improve its performance to a level that is
comparable to hugetlbfs, I won't see the later going away.
Cheers,
Longman
On Mon, 10 Sep 2018, Waiman Long wrote:
>On 09/08/2018 12:13 AM, John Hubbard wrote:
>>
>> Hi Daniel and all,
>>
>> I'm interested in the first 3 of those 4 topics, so if it doesn't conflict with HMM topics or
>> fix-gup-with-dma topics, I'd like to attend. GPUs generally need to access large chunks of
>> memory, and that includes migrating (dma-copying) pages around.
>>
>> So for example a multi-threaded migration of huge pages between normal RAM and GPU memory is an
>> intriguing direction (and I realize that it's a well-known topic, already). Doing that properly
>> (how many threads to use?) seems like it requires scheduler interaction.
>>
>> It's also interesting that there are two main huge page systems (THP and Hugetlbfs), and I sometimes
>> wonder the obvious thing to wonder: are these sufficiently different to warrant remaining separate,
>> long-term? Yes, I realize they're quite different in some ways, but still, one wonders. :)
>
>One major difference between hugetlbfs and THP is that the former has to
>be explicitly managed by the applications that use it whereas the latter
>is done automatically without the applications being aware that THP is
>being used at all. Performance wise, THP may or may not increase
>application performance depending on the exact memory access pattern,
>though the chance is usually higher that an application will benefit
>than suffer from it.
>
>If an application know what it is doing, using hughtblfs can boost
>performance more than it can ever achieved by THP. Many large enterprise
>applications, like Oracle DB, are using hugetlbfs and explicitly disable
>THP. So unless THP can improve its performance to a level that is
>comparable to hugetlbfs, I won't see the later going away.
Yep, there are a few non-trivial workloads out there that flat out discourage
thp, ie: redis to avoid latency issues.
Thanks,
Davidlohr
On 9/10/18 10:20 AM, Davidlohr Bueso wrote:
> On Mon, 10 Sep 2018, Waiman Long wrote:
>> On 09/08/2018 12:13 AM, John Hubbard wrote:
[...]
>>> It's also interesting that there are two main huge page systems (THP and Hugetlbfs), and I sometimes
>>> wonder the obvious thing to wonder: are these sufficiently different to warrant remaining separate,
>>> long-term? Yes, I realize they're quite different in some ways, but still, one wonders. :)
>>
>> One major difference between hugetlbfs and THP is that the former has to
>> be explicitly managed by the applications that use it whereas the latter
>> is done automatically without the applications being aware that THP is
>> being used at all. Performance wise, THP may or may not increase
>> application performance depending on the exact memory access pattern,
>> though the chance is usually higher that an application will benefit
>> than suffer from it.
>>
>> If an application know what it is doing, using hughtblfs can boost
>> performance more than it can ever achieved by THP. Many large enterprise
>> applications, like Oracle DB, are using hugetlbfs and explicitly disable
>> THP. So unless THP can improve its performance to a level that is
>> comparable to hugetlbfs, I won't see the later going away.
>
> Yep, there are a few non-trivial workloads out there that flat out discourage
> thp, ie: redis to avoid latency issues.
>
Yes, the need for guaranteed, available-now huge pages in some cases is
understood. That's not the quite same as saying that there have to be two different
subsystems, though. Nor does it even necessarily imply that the pool has to be
reserved in the same way as hugetlbfs does it...exactly.
So I'm wondering if THP behavior can be made to mimic hugetlbfs enough (perhaps
another option, in addition to "always, never, madvise") that we could just use
THP in all cases. But the "transparent" could become a sliding scale that could
go all the way down to "opaque" (hugetlbfs behavior).
thanks,
--
John Hubbard
NVIDIA
On 9/10/18 1:34 PM, John Hubbard wrote:
> On 9/10/18 10:20 AM, Davidlohr Bueso wrote:
>> On Mon, 10 Sep 2018, Waiman Long wrote:
>>> On 09/08/2018 12:13 AM, John Hubbard wrote:
> [...]
>>>> It's also interesting that there are two main huge page systems (THP and Hugetlbfs), and I sometimes
>>>> wonder the obvious thing to wonder: are these sufficiently different to warrant remaining separate,
>>>> long-term? Yes, I realize they're quite different in some ways, but still, one wonders. :)
>>>
>>> One major difference between hugetlbfs and THP is that the former has to
>>> be explicitly managed by the applications that use it whereas the latter
>>> is done automatically without the applications being aware that THP is
>>> being used at all. Performance wise, THP may or may not increase
>>> application performance depending on the exact memory access pattern,
>>> though the chance is usually higher that an application will benefit
>>> than suffer from it.
>>>
>>> If an application know what it is doing, using hughtblfs can boost
>>> performance more than it can ever achieved by THP. Many large enterprise
>>> applications, like Oracle DB, are using hugetlbfs and explicitly disable
>>> THP. So unless THP can improve its performance to a level that is
>>> comparable to hugetlbfs, I won't see the later going away.
>>
>> Yep, there are a few non-trivial workloads out there that flat out discourage
>> thp, ie: redis to avoid latency issues.
>>
>
> Yes, the need for guaranteed, available-now huge pages in some cases is
> understood. That's not the quite same as saying that there have to be two different
> subsystems, though. Nor does it even necessarily imply that the pool has to be
> reserved in the same way as hugetlbfs does it...exactly.
>
> So I'm wondering if THP behavior can be made to mimic hugetlbfs enough (perhaps
> another option, in addition to "always, never, madvise") that we could just use
> THP in all cases. But the "transparent" could become a sliding scale that could
> go all the way down to "opaque" (hugetlbfs behavior).
Leaving the interface aside, the idea that we could deduplicate redundant parts of the hugetlbfs and THP implementations, without user-visible change, seems promising.
On 9/8/18 12:13 AM, John Hubbard wrote:
> I'm interested in the first 3 of those 4 topics, so if it doesn't conflict with HMM topics or
> fix-gup-with-dma topics, I'd like to attend.
Great, we'll add your name to the list.
> GPUs generally need to access large chunks of
> memory, and that includes migrating (dma-copying) pages around.
>
> So for example a multi-threaded migration of huge pages between normal RAM and GPU memory is an
> intriguing direction (and I realize that it's a well-known topic, already). Doing that properly
> (how many threads to use?) seems like it requires scheduler interaction.
Yes, in past discussions of multithreading kernel work, there's been some discussion of a scheduler API that could answer "are there idle CPUs we could use to multithread?".
Instead of adding an interface, though, we could just let the scheduler do something it already knows how to do: prioritize.
Additional threads used to parallelize kernel work could run at the lowest priority (i.e. MAX_NICE). If the machine is heavily loaded, these extra threads simply won't run and other workloads on the system will be unaffected.
There's the issue of priority inversion if one or more of those extra threads get started and are then preempted by normal-priority tasks midway through, but the main thread doing the job can just will its priority to each worker in turn once it's finished, so at most one thread will be active on a heavily loaded system, again leaving other workloads on the system undisturbed.
On 09/10/2018 08:29 PM, Daniel Jordan wrote:
> On 9/10/18 1:34 PM, John Hubbard wrote:
>> On 9/10/18 10:20 AM, Davidlohr Bueso wrote:
>>> On Mon, 10 Sep 2018, Waiman Long wrote:
>>>> On 09/08/2018 12:13 AM, John Hubbard wrote:
>> [...]
>>>>> It's also interesting that there are two main huge page systems
>>>>> (THP and Hugetlbfs), and I sometimes
>>>>> wonder the obvious thing to wonder: are these sufficiently
>>>>> different to warrant remaining separate,
>>>>> long-term? Yes, I realize they're quite different in some ways,
>>>>> but still, one wonders. :)
>>>>
>>>> One major difference between hugetlbfs and THP is that the former
>>>> has to
>>>> be explicitly managed by the applications that use it whereas the
>>>> latter
>>>> is done automatically without the applications being aware that THP is
>>>> being used at all. Performance wise, THP may or may not increase
>>>> application performance depending on the exact memory access pattern,
>>>> though the chance is usually higher that an application will benefit
>>>> than suffer from it.
>>>>
>>>> If an application know what it is doing, using hughtblfs can boost
>>>> performance more than it can ever achieved by THP. Many large
>>>> enterprise
>>>> applications, like Oracle DB, are using hugetlbfs and explicitly
>>>> disable
>>>> THP. So unless THP can improve its performance to a level that is
>>>> comparable to hugetlbfs, I won't see the later going away.
>>>
>>> Yep, there are a few non-trivial workloads out there that flat out
>>> discourage
>>> thp, ie: redis to avoid latency issues.
>>>
>>
>> Yes, the need for guaranteed, available-now huge pages in some cases is
>> understood. That's not the quite same as saying that there have to be
>> two different
>> subsystems, though. Nor does it even necessarily imply that the pool
>> has to be
>> reserved in the same way as hugetlbfs does it...exactly.
>>
>> So I'm wondering if THP behavior can be made to mimic hugetlbfs
>> enough (perhaps
>> another option, in addition to "always, never, madvise") that we
>> could just use
>> THP in all cases. But the "transparent" could become a sliding scale
>> that could
>> go all the way down to "opaque" (hugetlbfs behavior).
>
> Leaving the interface aside, the idea that we could deduplicate
> redundant parts of the hugetlbfs and THP implementations, without
> user-visible change, seems promising.
That I think it is good idea if it can be done.
Thanks,
Longman