In some cases, userfaultfd mechanism should just deliver a SIGBUS signal
to the faulting process, instead of the page-fault event. Dealing with
page-fault event using a monitor thread can be an overhead in these
cases. For example applications like the database could use the signaling
mechanism for robustness purpose.
Database uses hugetlbfs for performance reason. Files on hugetlbfs
filesystem are created and huge pages allocated using fallocate() API.
Pages are deallocated/freed using fallocate() hole punching support.
These files are mmapped and accessed by many processes as shared memory.
The database keeps track of which offsets in the hugetlbfs file have
pages allocated.
Any access to mapped address over holes in the file, which can occur due
to bugs in the application, is considered invalid and expect the process
to simply receive a SIGBUS. However, currently when a hole in the file is
accessed via the mapped address, kernel/mm attempts to automatically
allocate a page at page fault time, resulting in implicitly filling the
hole in the file. This may not be the desired behavior for applications
like the database that want to explicitly manage page allocations of
hugetlbfs files.
Using userfaultfd mechanism, with this support to get a signal, database
application can prevent pages from being allocated implicitly when
processes access mapped address over holes in the file.
This patch adds the feature to request for a SIGBUS signal to userfaultfd
mechanism.
See following for previous discussion about the database requirement
leading to this proposal as suggested by Andrea.
http://www.spinics.net/lists/linux-mm/msg129224.html
Signed-off-by: Prakash <[email protected]>
---
fs/userfaultfd.c | 5 +++++
include/uapi/linux/userfaultfd.h | 10 +++++++++-
2 files changed, 14 insertions(+), 1 deletion(-)
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 1d622f2..5686d6d2 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -371,6 +371,11 @@ int handle_userfault(struct vm_fault *vmf, unsigned
long reason)
VM_BUG_ON(reason & ~(VM_UFFD_MISSING|VM_UFFD_WP));
VM_BUG_ON(!(reason & VM_UFFD_MISSING) ^ !!(reason & VM_UFFD_WP));
+ if (ctx->features & UFFD_FEATURE_SIGBUS) {
+ goto out;
+ }
+
/*
* If it's already released don't get it. This avoids to loop
* in __get_user_pages if userfaultfd_release waits on the
diff --git a/include/uapi/linux/userfaultfd.h
b/include/uapi/linux/userfaultfd.h
index 3b05953..d39d5db 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -23,7 +23,8 @@
UFFD_FEATURE_EVENT_REMOVE | \
UFFD_FEATURE_EVENT_UNMAP | \
UFFD_FEATURE_MISSING_HUGETLBFS | \
- UFFD_FEATURE_MISSING_SHMEM)
+ UFFD_FEATURE_MISSING_SHMEM | \
+ UFFD_FEATURE_SIGBUS)
#define UFFD_API_IOCTLS \
((__u64)1 << _UFFDIO_REGISTER | \
(__u64)1 << _UFFDIO_UNREGISTER | \
@@ -153,6 +154,12 @@ struct uffdio_api {
* UFFD_FEATURE_MISSING_SHMEM works the same as
* UFFD_FEATURE_MISSING_HUGETLBFS, but it applies to shmem
* (i.e. tmpfs and other shmem based APIs).
+ *
+ * UFFD_FEATURE_SIGBUS feature means no page-fault
+ * (UFFD_EVENT_PAGEFAULT) event will be delivered, instead
+ * a SIGBUS signal will be sent to the faulting process.
+ * The application process can enable this behavior by adding
+ * it to uffdio_api.features.
*/
#define UFFD_FEATURE_PAGEFAULT_FLAG_WP (1<<0)
#define UFFD_FEATURE_EVENT_FORK (1<<1)
@@ -161,6 +168,7 @@ struct uffdio_api {
#define UFFD_FEATURE_MISSING_HUGETLBFS (1<<4)
#define UFFD_FEATURE_MISSING_SHMEM (1<<5)
#define UFFD_FEATURE_EVENT_UNMAP (1<<6)
+#define UFFD_FEATURE_SIGBUS (1<<7)
__u64 features;
__u64 ioctls;
--
2.7.4
This is an user visible API so let's CC linux-api mailing list.
On Mon 26-06-17 12:46:13, Prakash Sangappa wrote:
> In some cases, userfaultfd mechanism should just deliver a SIGBUS signal
> to the faulting process, instead of the page-fault event. Dealing with
> page-fault event using a monitor thread can be an overhead in these
> cases. For example applications like the database could use the signaling
> mechanism for robustness purpose.
this is rather confusing. What is the reason that the monitor would be
slower than signal delivery and handling?
> Database uses hugetlbfs for performance reason. Files on hugetlbfs
> filesystem are created and huge pages allocated using fallocate() API.
> Pages are deallocated/freed using fallocate() hole punching support.
> These files are mmapped and accessed by many processes as shared memory.
> The database keeps track of which offsets in the hugetlbfs file have
> pages allocated.
>
> Any access to mapped address over holes in the file, which can occur due
> to bugs in the application, is considered invalid and expect the process
> to simply receive a SIGBUS. However, currently when a hole in the file is
> accessed via the mapped address, kernel/mm attempts to automatically
> allocate a page at page fault time, resulting in implicitly filling the
> hole in the file. This may not be the desired behavior for applications
> like the database that want to explicitly manage page allocations of
> hugetlbfs files.
So you register UFFD_FEATURE_SIGBUS on each region tha you are unmapping
and than just let those offenders die?
> Using userfaultfd mechanism, with this support to get a signal, database
> application can prevent pages from being allocated implicitly when
> processes access mapped address over holes in the file.
>
> This patch adds the feature to request for a SIGBUS signal to userfaultfd
> mechanism.
>
> See following for previous discussion about the database requirement
> leading to this proposal as suggested by Andrea.
>
> http://www.spinics.net/lists/linux-mm/msg129224.html
Please make those requirements part of the changelog.
> Signed-off-by: Prakash <[email protected]>
> ---
> fs/userfaultfd.c | 5 +++++
> include/uapi/linux/userfaultfd.h | 10 +++++++++-
> 2 files changed, 14 insertions(+), 1 deletion(-)
>
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 1d622f2..5686d6d2 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -371,6 +371,11 @@ int handle_userfault(struct vm_fault *vmf, unsigned
> long reason)
> VM_BUG_ON(reason & ~(VM_UFFD_MISSING|VM_UFFD_WP));
> VM_BUG_ON(!(reason & VM_UFFD_MISSING) ^ !!(reason & VM_UFFD_WP));
>
> + if (ctx->features & UFFD_FEATURE_SIGBUS) {
> + goto out;
> + }
> +
> /*
> * If it's already released don't get it. This avoids to loop
> * in __get_user_pages if userfaultfd_release waits on the
> diff --git a/include/uapi/linux/userfaultfd.h
> b/include/uapi/linux/userfaultfd.h
> index 3b05953..d39d5db 100644
> --- a/include/uapi/linux/userfaultfd.h
> +++ b/include/uapi/linux/userfaultfd.h
> @@ -23,7 +23,8 @@
> UFFD_FEATURE_EVENT_REMOVE | \
> UFFD_FEATURE_EVENT_UNMAP | \
> UFFD_FEATURE_MISSING_HUGETLBFS | \
> - UFFD_FEATURE_MISSING_SHMEM)
> + UFFD_FEATURE_MISSING_SHMEM | \
> + UFFD_FEATURE_SIGBUS)
> #define UFFD_API_IOCTLS \
> ((__u64)1 << _UFFDIO_REGISTER | \
> (__u64)1 << _UFFDIO_UNREGISTER | \
> @@ -153,6 +154,12 @@ struct uffdio_api {
> * UFFD_FEATURE_MISSING_SHMEM works the same as
> * UFFD_FEATURE_MISSING_HUGETLBFS, but it applies to shmem
> * (i.e. tmpfs and other shmem based APIs).
> + *
> + * UFFD_FEATURE_SIGBUS feature means no page-fault
> + * (UFFD_EVENT_PAGEFAULT) event will be delivered, instead
> + * a SIGBUS signal will be sent to the faulting process.
> + * The application process can enable this behavior by adding
> + * it to uffdio_api.features.
> */
> #define UFFD_FEATURE_PAGEFAULT_FLAG_WP (1<<0)
> #define UFFD_FEATURE_EVENT_FORK (1<<1)
> @@ -161,6 +168,7 @@ struct uffdio_api {
> #define UFFD_FEATURE_MISSING_HUGETLBFS (1<<4)
> #define UFFD_FEATURE_MISSING_SHMEM (1<<5)
> #define UFFD_FEATURE_EVENT_UNMAP (1<<6)
> +#define UFFD_FEATURE_SIGBUS (1<<7)
> __u64 features;
>
> __u64 ioctls;
> --
> 2.7.4
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
--
Michal Hocko
SUSE Labs
On Tue, Jun 27, 2017 at 09:06:43AM +0200, Michal Hocko wrote:
> This is an user visible API so let's CC linux-api mailing list.
>
> On Mon 26-06-17 12:46:13, Prakash Sangappa wrote:
> > In some cases, userfaultfd mechanism should just deliver a SIGBUS signal
> > to the faulting process, instead of the page-fault event. Dealing with
> > page-fault event using a monitor thread can be an overhead in these
> > cases. For example applications like the database could use the signaling
> > mechanism for robustness purpose.
>
> this is rather confusing. What is the reason that the monitor would be
> slower than signal delivery and handling?
>
> > Database uses hugetlbfs for performance reason. Files on hugetlbfs
> > filesystem are created and huge pages allocated using fallocate() API.
> > Pages are deallocated/freed using fallocate() hole punching support.
> > These files are mmapped and accessed by many processes as shared memory.
> > The database keeps track of which offsets in the hugetlbfs file have
> > pages allocated.
> >
> > Any access to mapped address over holes in the file, which can occur due
> > to bugs in the application, is considered invalid and expect the process
> > to simply receive a SIGBUS. However, currently when a hole in the file is
> > accessed via the mapped address, kernel/mm attempts to automatically
> > allocate a page at page fault time, resulting in implicitly filling the
> > hole in the file. This may not be the desired behavior for applications
> > like the database that want to explicitly manage page allocations of
> > hugetlbfs files.
>
> So you register UFFD_FEATURE_SIGBUS on each region tha you are unmapping
> and than just let those offenders die?
If I understand correctly, the database will create the mapping, then it'll
open userfaultfd and register those mappings with the userfault.
Afterwards, when the application accesses a hole userfault will cause
SIGBUS and the application will process it in whatever way it likes, e.g.
just die.
What I don't understand is why won't you use userfault monitor process that
will take care of the page fault events?
It shouldn't be much overhead running it and it can keep track on all the
userfault file descriptors for you and it will allow more versatile error
handling that SIGBUS.
> > Using userfaultfd mechanism, with this support to get a signal, database
> > application can prevent pages from being allocated implicitly when
> > processes access mapped address over holes in the file.
> >
> > This patch adds the feature to request for a SIGBUS signal to userfaultfd
> > mechanism.
> >
> > See following for previous discussion about the database requirement
> > leading to this proposal as suggested by Andrea.
> >
> > http://www.spinics.net/lists/linux-mm/msg129224.html
>
> Please make those requirements part of the changelog.
>
> > Signed-off-by: Prakash <[email protected]>
> > ---
> > fs/userfaultfd.c | 5 +++++
> > include/uapi/linux/userfaultfd.h | 10 +++++++++-
> > 2 files changed, 14 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > index 1d622f2..5686d6d2 100644
> > --- a/fs/userfaultfd.c
> > +++ b/fs/userfaultfd.c
> > @@ -371,6 +371,11 @@ int handle_userfault(struct vm_fault *vmf, unsigned
> > long reason)
> > VM_BUG_ON(reason & ~(VM_UFFD_MISSING|VM_UFFD_WP));
> > VM_BUG_ON(!(reason & VM_UFFD_MISSING) ^ !!(reason & VM_UFFD_WP));
> >
> > + if (ctx->features & UFFD_FEATURE_SIGBUS) {
> > + goto out;
> > + }
> > +
> > /*
> > * If it's already released don't get it. This avoids to loop
> > * in __get_user_pages if userfaultfd_release waits on the
> > diff --git a/include/uapi/linux/userfaultfd.h
> > b/include/uapi/linux/userfaultfd.h
> > index 3b05953..d39d5db 100644
> > --- a/include/uapi/linux/userfaultfd.h
> > +++ b/include/uapi/linux/userfaultfd.h
> > @@ -23,7 +23,8 @@
> > UFFD_FEATURE_EVENT_REMOVE | \
> > UFFD_FEATURE_EVENT_UNMAP | \
> > UFFD_FEATURE_MISSING_HUGETLBFS | \
> > - UFFD_FEATURE_MISSING_SHMEM)
> > + UFFD_FEATURE_MISSING_SHMEM | \
> > + UFFD_FEATURE_SIGBUS)
> > #define UFFD_API_IOCTLS \
> > ((__u64)1 << _UFFDIO_REGISTER | \
> > (__u64)1 << _UFFDIO_UNREGISTER | \
> > @@ -153,6 +154,12 @@ struct uffdio_api {
> > * UFFD_FEATURE_MISSING_SHMEM works the same as
> > * UFFD_FEATURE_MISSING_HUGETLBFS, but it applies to shmem
> > * (i.e. tmpfs and other shmem based APIs).
> > + *
> > + * UFFD_FEATURE_SIGBUS feature means no page-fault
> > + * (UFFD_EVENT_PAGEFAULT) event will be delivered, instead
> > + * a SIGBUS signal will be sent to the faulting process.
> > + * The application process can enable this behavior by adding
> > + * it to uffdio_api.features.
> > */
> > #define UFFD_FEATURE_PAGEFAULT_FLAG_WP (1<<0)
> > #define UFFD_FEATURE_EVENT_FORK (1<<1)
> > @@ -161,6 +168,7 @@ struct uffdio_api {
> > #define UFFD_FEATURE_MISSING_HUGETLBFS (1<<4)
> > #define UFFD_FEATURE_MISSING_SHMEM (1<<5)
> > #define UFFD_FEATURE_EVENT_UNMAP (1<<6)
> > +#define UFFD_FEATURE_SIGBUS (1<<7)
> > __u64 features;
> >
> > __u64 ioctls;
> > --
> > 2.7.4
> >
> --
> Michal Hocko
> SUSE Labs
>
--
Sincerely yours,
Mike.
On 6/27/17 12:06 AM, Michal Hocko wrote:
> This is an user visible API so let's CC linux-api mailing list.
>
> On Mon 26-06-17 12:46:13, Prakash Sangappa wrote:
>> In some cases, userfaultfd mechanism should just deliver a SIGBUS signal
>> to the faulting process, instead of the page-fault event. Dealing with
>> page-fault event using a monitor thread can be an overhead in these
>> cases. For example applications like the database could use the signaling
>> mechanism for robustness purpose.
> this is rather confusing. What is the reason that the monitor would be
> slower than signal delivery and handling?
There are a large number of single threaded database processes involved,
each of these processes will require a monitor thread which is considered
an overhead.
>
>> Database uses hugetlbfs for performance reason. Files on hugetlbfs
>> filesystem are created and huge pages allocated using fallocate() API.
>> Pages are deallocated/freed using fallocate() hole punching support.
>> These files are mmapped and accessed by many processes as shared memory.
>> The database keeps track of which offsets in the hugetlbfs file have
>> pages allocated.
>>
>> Any access to mapped address over holes in the file, which can occur due
>> to bugs in the application, is considered invalid and expect the process
>> to simply receive a SIGBUS. However, currently when a hole in the file is
>> accessed via the mapped address, kernel/mm attempts to automatically
>> allocate a page at page fault time, resulting in implicitly filling the
>> hole in the file. This may not be the desired behavior for applications
>> like the database that want to explicitly manage page allocations of
>> hugetlbfs files.
> So you register UFFD_FEATURE_SIGBUS on each region tha you are unmapping
> and than just let those offenders die?
The database application will create the mapping and register with
userfault.
Subsequently when the processes the mapping over a hole will result in
SIGBUS
and die.
>
>> Using userfaultfd mechanism, with this support to get a signal, database
>> application can prevent pages from being allocated implicitly when
>> processes access mapped address over holes in the file.
>>
>> This patch adds the feature to request for a SIGBUS signal to userfaultfd
>> mechanism.
>>
>> See following for previous discussion about the database requirement
>> leading to this proposal as suggested by Andrea.
>>
>> http://www.spinics.net/lists/linux-mm/msg129224.html
> Please make those requirements part of the changelog.
The requirement is described above, which is the need for the database
application to not fill hole implicitly. Sorry, if this was not clear. I
will update the change log and send a v2 patch again.
On 6/27/17 8:35 AM, Mike Rapoport wrote:
> On Tue, Jun 27, 2017 at 09:06:43AM +0200, Michal Hocko wrote:
>> This is an user visible API so let's CC linux-api mailing list.
>>
>> On Mon 26-06-17 12:46:13, Prakash Sangappa wrote:
>>> In some cases, userfaultfd mechanism should just deliver a SIGBUS signal
>>> to the faulting process, instead of the page-fault event. Dealing with
>>> page-fault event using a monitor thread can be an overhead in these
>>> cases. For example applications like the database could use the signaling
>>> mechanism for robustness purpose.
>> this is rather confusing. What is the reason that the monitor would be
>> slower than signal delivery and handling?
>>
>>> Database uses hugetlbfs for performance reason. Files on hugetlbfs
>>> filesystem are created and huge pages allocated using fallocate() API.
>>> Pages are deallocated/freed using fallocate() hole punching support.
>>> These files are mmapped and accessed by many processes as shared memory.
>>> The database keeps track of which offsets in the hugetlbfs file have
>>> pages allocated.
>>>
>>> Any access to mapped address over holes in the file, which can occur due
>>> to bugs in the application, is considered invalid and expect the process
>>> to simply receive a SIGBUS. However, currently when a hole in the file is
>>> accessed via the mapped address, kernel/mm attempts to automatically
>>> allocate a page at page fault time, resulting in implicitly filling the
>>> hole in the file. This may not be the desired behavior for applications
>>> like the database that want to explicitly manage page allocations of
>>> hugetlbfs files.
>> So you register UFFD_FEATURE_SIGBUS on each region tha you are unmapping
>> and than just let those offenders die?
>
> If I understand correctly, the database will create the mapping, then it'll
> open userfaultfd and register those mappings with the userfault.
> Afterwards, when the application accesses a hole userfault will cause
> SIGBUS and the application will process it in whatever way it likes, e.g.
> just die.
Yes.
> What I don't understand is why won't you use userfault monitor process that
> will take care of the page fault events?
> It shouldn't be much overhead running it and it can keep track on all the
> userfault file descriptors for you and it will allow more versatile error
> handling that SIGBUS.
>
Co-ordination with the external monitor process by all the database
processes
to send their userfaultfd is still an overhead.
>>> Using userfaultfd mechanism, with this support to get a signal, database
>>> application can prevent pages from being allocated implicitly when
>>> processes access mapped address over holes in the file.
>>>
>>> This patch adds the feature to request for a SIGBUS signal to userfaultfd
>>> mechanism.
>>>
>>> See following for previous discussion about the database requirement
>>> leading to this proposal as suggested by Andrea.
>>>
>>> http://www.spinics.net/lists/linux-mm/msg129224.html
>> Please make those requirements part of the changelog.
>>
>>> Signed-off-by: Prakash <[email protected]>
>>> ---
>>> fs/userfaultfd.c | 5 +++++
>>> include/uapi/linux/userfaultfd.h | 10 +++++++++-
>>> 2 files changed, 14 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
>>> index 1d622f2..5686d6d2 100644
>>> --- a/fs/userfaultfd.c
>>> +++ b/fs/userfaultfd.c
>>> @@ -371,6 +371,11 @@ int handle_userfault(struct vm_fault *vmf, unsigned
>>> long reason)
>>> VM_BUG_ON(reason & ~(VM_UFFD_MISSING|VM_UFFD_WP));
>>> VM_BUG_ON(!(reason & VM_UFFD_MISSING) ^ !!(reason & VM_UFFD_WP));
>>>
>>> + if (ctx->features & UFFD_FEATURE_SIGBUS) {
>>> + goto out;
>>> + }
>>> +
>>> /*
>>> * If it's already released don't get it. This avoids to loop
>>> * in __get_user_pages if userfaultfd_release waits on the
>>> diff --git a/include/uapi/linux/userfaultfd.h
>>> b/include/uapi/linux/userfaultfd.h
>>> index 3b05953..d39d5db 100644
>>> --- a/include/uapi/linux/userfaultfd.h
>>> +++ b/include/uapi/linux/userfaultfd.h
>>> @@ -23,7 +23,8 @@
>>> UFFD_FEATURE_EVENT_REMOVE | \
>>> UFFD_FEATURE_EVENT_UNMAP | \
>>> UFFD_FEATURE_MISSING_HUGETLBFS | \
>>> - UFFD_FEATURE_MISSING_SHMEM)
>>> + UFFD_FEATURE_MISSING_SHMEM | \
>>> + UFFD_FEATURE_SIGBUS)
>>> #define UFFD_API_IOCTLS \
>>> ((__u64)1 << _UFFDIO_REGISTER | \
>>> (__u64)1 << _UFFDIO_UNREGISTER | \
>>> @@ -153,6 +154,12 @@ struct uffdio_api {
>>> * UFFD_FEATURE_MISSING_SHMEM works the same as
>>> * UFFD_FEATURE_MISSING_HUGETLBFS, but it applies to shmem
>>> * (i.e. tmpfs and other shmem based APIs).
>>> + *
>>> + * UFFD_FEATURE_SIGBUS feature means no page-fault
>>> + * (UFFD_EVENT_PAGEFAULT) event will be delivered, instead
>>> + * a SIGBUS signal will be sent to the faulting process.
>>> + * The application process can enable this behavior by adding
>>> + * it to uffdio_api.features.
>>> */
>>> #define UFFD_FEATURE_PAGEFAULT_FLAG_WP (1<<0)
>>> #define UFFD_FEATURE_EVENT_FORK (1<<1)
>>> @@ -161,6 +168,7 @@ struct uffdio_api {
>>> #define UFFD_FEATURE_MISSING_HUGETLBFS (1<<4)
>>> #define UFFD_FEATURE_MISSING_SHMEM (1<<5)
>>> #define UFFD_FEATURE_EVENT_UNMAP (1<<6)
>>> +#define UFFD_FEATURE_SIGBUS (1<<7)
>>> __u64 features;
>>>
>>> __u64 ioctls;
>>> --
>>> 2.7.4
>>>
>> --
>> Michal Hocko
>> SUSE Labs
>>
> --
> Sincerely yours,
> Mike.
>
On Tue, Jun 27, 2017 at 09:01:20AM -0700, Prakash Sangappa wrote:
> On 6/27/17 8:35 AM, Mike Rapoport wrote:
>
> >On Tue, Jun 27, 2017 at 09:06:43AM +0200, Michal Hocko wrote:
> >>This is an user visible API so let's CC linux-api mailing list.
> >>
> >>On Mon 26-06-17 12:46:13, Prakash Sangappa wrote:
> >>>In some cases, userfaultfd mechanism should just deliver a SIGBUS signal
> >>>to the faulting process, instead of the page-fault event. Dealing with
> >>>page-fault event using a monitor thread can be an overhead in these
> >>>cases. For example applications like the database could use the signaling
> >>>mechanism for robustness purpose.
> >>this is rather confusing. What is the reason that the monitor would be
> >>slower than signal delivery and handling?
> >>
> >>>Database uses hugetlbfs for performance reason. Files on hugetlbfs
> >>>filesystem are created and huge pages allocated using fallocate() API.
> >>>Pages are deallocated/freed using fallocate() hole punching support.
> >>>These files are mmapped and accessed by many processes as shared memory.
> >>>The database keeps track of which offsets in the hugetlbfs file have
> >>>pages allocated.
> >>>
> >>>Any access to mapped address over holes in the file, which can occur due
> >>>to bugs in the application, is considered invalid and expect the process
> >>>to simply receive a SIGBUS. However, currently when a hole in the file is
> >>>accessed via the mapped address, kernel/mm attempts to automatically
> >>>allocate a page at page fault time, resulting in implicitly filling the
> >>>hole in the file. This may not be the desired behavior for applications
> >>>like the database that want to explicitly manage page allocations of
> >>>hugetlbfs files.
> >>So you register UFFD_FEATURE_SIGBUS on each region tha you are unmapping
> >>and than just let those offenders die?
> >If I understand correctly, the database will create the mapping, then it'll
> >open userfaultfd and register those mappings with the userfault.
> >Afterwards, when the application accesses a hole userfault will cause
> >SIGBUS and the application will process it in whatever way it likes, e.g.
> >just die.
>
> Yes.
>
> >What I don't understand is why won't you use userfault monitor process that
> >will take care of the page fault events?
> >It shouldn't be much overhead running it and it can keep track on all the
> >userfault file descriptors for you and it will allow more versatile error
> >handling that SIGBUS.
> >
>
> Co-ordination with the external monitor process by all the database
> processes
> to send their userfaultfd is still an overhead.
You are planning to register in userfaultfd only the holes you punch to
deallocate pages, am I right?
And the co-ordination of the userfault file descriptor with the monitor
would have been added after calls to fallocate() and userfaultfd_register()?
I've just been thinking that maybe it would be possible to use
UFFD_EVENT_REMOVE for this case. We anyway need to implement the generation
of UFFD_EVENT_REMOVE for the case of hole punching in hugetlbfs for
non-cooperative userfaultfd. It could be that it will solve your issue as
well.
> >>>Using userfaultfd mechanism, with this support to get a signal, database
> >>>application can prevent pages from being allocated implicitly when
> >>>processes access mapped address over holes in the file.
> >>>
> >>>This patch adds the feature to request for a SIGBUS signal to userfaultfd
> >>>mechanism.
> >>>
> >>>See following for previous discussion about the database requirement
> >>>leading to this proposal as suggested by Andrea.
> >>>
> >>>http://www.spinics.net/lists/linux-mm/msg129224.html
> >>Please make those requirements part of the changelog.
> >>
> >>>Signed-off-by: Prakash <[email protected]>
> >>>---
> >>> fs/userfaultfd.c | 5 +++++
> >>> include/uapi/linux/userfaultfd.h | 10 +++++++++-
> >>> 2 files changed, 14 insertions(+), 1 deletion(-)
> >>>
> >>>diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> >>>index 1d622f2..5686d6d2 100644
> >>>--- a/fs/userfaultfd.c
> >>>+++ b/fs/userfaultfd.c
> >>>@@ -371,6 +371,11 @@ int handle_userfault(struct vm_fault *vmf, unsigned
> >>>long reason)
> >>> VM_BUG_ON(reason & ~(VM_UFFD_MISSING|VM_UFFD_WP));
> >>> VM_BUG_ON(!(reason & VM_UFFD_MISSING) ^ !!(reason & VM_UFFD_WP));
> >>>
> >>>+ if (ctx->features & UFFD_FEATURE_SIGBUS) {
> >>>+ goto out;
> >>>+ }
> >>>+
> >>> /*
> >>> * If it's already released don't get it. This avoids to loop
> >>> * in __get_user_pages if userfaultfd_release waits on the
> >>>diff --git a/include/uapi/linux/userfaultfd.h
> >>>b/include/uapi/linux/userfaultfd.h
> >>>index 3b05953..d39d5db 100644
> >>>--- a/include/uapi/linux/userfaultfd.h
> >>>+++ b/include/uapi/linux/userfaultfd.h
> >>>@@ -23,7 +23,8 @@
> >>> UFFD_FEATURE_EVENT_REMOVE | \
> >>> UFFD_FEATURE_EVENT_UNMAP | \
> >>> UFFD_FEATURE_MISSING_HUGETLBFS | \
> >>>- UFFD_FEATURE_MISSING_SHMEM)
> >>>+ UFFD_FEATURE_MISSING_SHMEM | \
> >>>+ UFFD_FEATURE_SIGBUS)
> >>> #define UFFD_API_IOCTLS \
> >>> ((__u64)1 << _UFFDIO_REGISTER | \
> >>> (__u64)1 << _UFFDIO_UNREGISTER | \
> >>>@@ -153,6 +154,12 @@ struct uffdio_api {
> >>> * UFFD_FEATURE_MISSING_SHMEM works the same as
> >>> * UFFD_FEATURE_MISSING_HUGETLBFS, but it applies to shmem
> >>> * (i.e. tmpfs and other shmem based APIs).
> >>>+ *
> >>>+ * UFFD_FEATURE_SIGBUS feature means no page-fault
> >>>+ * (UFFD_EVENT_PAGEFAULT) event will be delivered, instead
> >>>+ * a SIGBUS signal will be sent to the faulting process.
> >>>+ * The application process can enable this behavior by adding
> >>>+ * it to uffdio_api.features.
> >>> */
> >>> #define UFFD_FEATURE_PAGEFAULT_FLAG_WP (1<<0)
> >>> #define UFFD_FEATURE_EVENT_FORK (1<<1)
> >>>@@ -161,6 +168,7 @@ struct uffdio_api {
> >>> #define UFFD_FEATURE_MISSING_HUGETLBFS (1<<4)
> >>> #define UFFD_FEATURE_MISSING_SHMEM (1<<5)
> >>> #define UFFD_FEATURE_EVENT_UNMAP (1<<6)
> >>>+#define UFFD_FEATURE_SIGBUS (1<<7)
> >>> __u64 features;
> >>>
> >>> __u64 ioctls;
> >>>--
> >>>2.7.4
> >>>
> >>--
> >>Michal Hocko
> >>SUSE Labs
> >>
> >--
> >Sincerely yours,
> >Mike.
> >
>
On 6/28/17 6:18 AM, Mike Rapoport wrote:
> On Tue, Jun 27, 2017 at 09:01:20AM -0700, Prakash Sangappa wrote:
>> On 6/27/17 8:35 AM, Mike Rapoport wrote:
>>
>>> On Tue, Jun 27, 2017 at 09:06:43AM +0200, Michal Hocko wrote:
>>>> This is an user visible API so let's CC linux-api mailing list.
>>>>
>>>> On Mon 26-06-17 12:46:13, Prakash Sangappa wrote:
>>>>
>>>>> Any access to mapped address over holes in the file, which can occur due
>>>>> to bugs in the application, is considered invalid and expect the process
>>>>> to simply receive a SIGBUS. However, currently when a hole in the file is
>>>>> accessed via the mapped address, kernel/mm attempts to automatically
>>>>> allocate a page at page fault time, resulting in implicitly filling the
>>>>> hole in the file. This may not be the desired behavior for applications
>>>>> like the database that want to explicitly manage page allocations of
>>>>> hugetlbfs files.
>>>> So you register UFFD_FEATURE_SIGBUS on each region tha you are unmapping
>>>> and than just let those offenders die?
>>> If I understand correctly, the database will create the mapping, then it'll
>>> open userfaultfd and register those mappings with the userfault.
>>> Afterwards, when the application accesses a hole userfault will cause
>>> SIGBUS and the application will process it in whatever way it likes, e.g.
>>> just die.
>> Yes.
>>
>>> What I don't understand is why won't you use userfault monitor process that
>>> will take care of the page fault events?
>>> It shouldn't be much overhead running it and it can keep track on all the
>>> userfault file descriptors for you and it will allow more versatile error
>>> handling that SIGBUS.
>>>
>> Co-ordination with the external monitor process by all the database
>> processes
>> to send their userfaultfd is still an overhead.
> You are planning to register in userfaultfd only the holes you punch to
> deallocate pages, am I right?
No, the entire mmap'ed region. The DB processes would mmap(MAP_NORESERVE)
hugetlbfs files, register this mapped address with userfaultfd ones
right after
the mmap() call.
>
> And the co-ordination of the userfault file descriptor with the monitor
> would have been added after calls to fallocate() and userfaultfd_register()?
Well, the database application does not need to deal with a monitor.
>
> I've just been thinking that maybe it would be possible to use
> UFFD_EVENT_REMOVE for this case. We anyway need to implement the generation
> of UFFD_EVENT_REMOVE for the case of hole punching in hugetlbfs for
> non-cooperative userfaultfd. It could be that it will solve your issue as
> well.
>
Will this result in a signal delivery?
In the use case described, the database application does not need any event
for hole punching. Basically, just a signal for any invalid access to
mapped
area over holes in the file.
On Wed 28-06-17 11:23:32, Prakash Sangappa wrote:
>
>
> On 6/28/17 6:18 AM, Mike Rapoport wrote:
[...]
> >I've just been thinking that maybe it would be possible to use
> >UFFD_EVENT_REMOVE for this case. We anyway need to implement the generation
> >of UFFD_EVENT_REMOVE for the case of hole punching in hugetlbfs for
> >non-cooperative userfaultfd. It could be that it will solve your issue as
> >well.
> >
>
> Will this result in a signal delivery?
>
> In the use case described, the database application does not need any event
> for hole punching. Basically, just a signal for any invalid access to
> mapped area over holes in the file.
OK, but it would be better to think that through for other potential
usecases so that this doesn't end up as a single hugetlb feature. E.g.
what should happen if a regular anonymous memory gets swapped out?
Should we deliver signal as well? How does userspace tell whether this
was a no backing page from unavailable backing page?
--
Michal Hocko
SUSE Labs
On Wed, Jun 28, 2017 at 11:23:32AM -0700, Prakash Sangappa wrote:
>
>
> On 6/28/17 6:18 AM, Mike Rapoport wrote:
> >On Tue, Jun 27, 2017 at 09:01:20AM -0700, Prakash Sangappa wrote:
> >>On 6/27/17 8:35 AM, Mike Rapoport wrote:
> >>
> >>>On Tue, Jun 27, 2017 at 09:06:43AM +0200, Michal Hocko wrote:
> >>>>This is an user visible API so let's CC linux-api mailing list.
> >>>>
> >>>>On Mon 26-06-17 12:46:13, Prakash Sangappa wrote:
> >>>>
> >>>>>Any access to mapped address over holes in the file, which can occur due
> >>>>>to bugs in the application, is considered invalid and expect the process
> >>>>>to simply receive a SIGBUS. However, currently when a hole in the file is
> >>>>>accessed via the mapped address, kernel/mm attempts to automatically
> >>>>>allocate a page at page fault time, resulting in implicitly filling the
> >>>>>hole in the file. This may not be the desired behavior for applications
> >>>>>like the database that want to explicitly manage page allocations of
> >>>>>hugetlbfs files.
> >>>>So you register UFFD_FEATURE_SIGBUS on each region tha you are unmapping
> >>>>and than just let those offenders die?
> >>>If I understand correctly, the database will create the mapping, then it'll
> >>>open userfaultfd and register those mappings with the userfault.
> >>>Afterwards, when the application accesses a hole userfault will cause
> >>>SIGBUS and the application will process it in whatever way it likes, e.g.
> >>>just die.
> >>Yes.
> >>
> >>>What I don't understand is why won't you use userfault monitor process that
> >>>will take care of the page fault events?
> >>>It shouldn't be much overhead running it and it can keep track on all the
> >>>userfault file descriptors for you and it will allow more versatile error
> >>>handling that SIGBUS.
> >>>
> >>Co-ordination with the external monitor process by all the database
> >>processes
> >>to send their userfaultfd is still an overhead.
> >You are planning to register in userfaultfd only the holes you punch to
> >deallocate pages, am I right?
>
>
> No, the entire mmap'ed region. The DB processes would mmap(MAP_NORESERVE)
> hugetlbfs files, register this mapped address with userfaultfd ones right
> after
> the mmap() call.
>
> >
> >And the co-ordination of the userfault file descriptor with the monitor
> >would have been added after calls to fallocate() and userfaultfd_register()?
>
> Well, the database application does not need to deal with a monitor.
>
> >
> >I've just been thinking that maybe it would be possible to use
> >UFFD_EVENT_REMOVE for this case. We anyway need to implement the generation
> >of UFFD_EVENT_REMOVE for the case of hole punching in hugetlbfs for
> >non-cooperative userfaultfd. It could be that it will solve your issue as
> >well.
> >
>
> Will this result in a signal delivery?
>
> In the use case described, the database application does not need any event
> for hole punching. Basically, just a signal for any invalid access to
> mapped
> area over holes in the file.
Well, what I had in mind was using a single-process uffd monitor that will
track all the userfault file descriptors. With UFFD_EVENT_REMOVE this
process will know what areas are invalid and it will be able to process the
invalid access in any way it likes, e.g. send SIGBUS to the database
application.
If you mmap() and userfaultfd_register() only at the initialization time,
it might be also possible to avoid sending userfault file descriptors to
the monitor process with UFFD_FEATURE_EVENT_FORK.
--
Sincerely yours,
Mike.
On 06/29/2017 01:09 AM, Michal Hocko wrote:
> On Wed 28-06-17 11:23:32, Prakash Sangappa wrote:
>>
>> On 6/28/17 6:18 AM, Mike Rapoport wrote:
> [...]
>>> I've just been thinking that maybe it would be possible to use
>>> UFFD_EVENT_REMOVE for this case. We anyway need to implement the generation
>>> of UFFD_EVENT_REMOVE for the case of hole punching in hugetlbfs for
>>> non-cooperative userfaultfd. It could be that it will solve your issue as
>>> well.
>>>
>> Will this result in a signal delivery?
>>
>> In the use case described, the database application does not need any event
>> for hole punching. Basically, just a signal for any invalid access to
>> mapped area over holes in the file.
> OK, but it would be better to think that through for other potential
> usecases so that this doesn't end up as a single hugetlb feature. E.g.
> what should happen if a regular anonymous memory gets swapped out?
> Should we deliver signal as well? How does userspace tell whether this
> was a no backing page from unavailable backing page?
This may not be useful in all cases. Potential, it could be used
with use of mlock() on anonymous memory to ensure any access
to memory that is not locked is caught, again for robustness
purpose.
On 06/29/2017 03:46 AM, Mike Rapoport wrote:
> On Wed, Jun 28, 2017 at 11:23:32AM -0700, Prakash Sangappa wrote:
[...]
>>
>> Will this result in a signal delivery?
>>
>> In the use case described, the database application does not need any event
>> for hole punching. Basically, just a signal for any invalid access to
>> mapped
>> area over holes in the file.
>
> Well, what I had in mind was using a single-process uffd monitor that will
> track all the userfault file descriptors. With UFFD_EVENT_REMOVE this
> process will know what areas are invalid and it will be able to process the
> invalid access in any way it likes, e.g. send SIGBUS to the database
> application.
Use of a monitor process is also an overhead for the database.
>
> If you mmap() and userfaultfd_register() only at the initialization time,
> it might be also possible to avoid sending userfault file descriptors to
> the monitor process with UFFD_FEATURE_EVENT_FORK.
The new processes are always exec'd in the database case and these
processes could be mapping different files. So, not sure if
UFFD_FEATURE_EVENT_FORK will be useful. Also, it may not be one
process spawning the other new processes.
>
> --
> Sincerely yours,
> Mike.
>
[CC John, the thread started
http://lkml.kernel.org/r/[email protected]]
On Thu 29-06-17 14:41:22, prakash.sangappa wrote:
>
>
> On 06/29/2017 01:09 AM, Michal Hocko wrote:
> >On Wed 28-06-17 11:23:32, Prakash Sangappa wrote:
> >>
> >>On 6/28/17 6:18 AM, Mike Rapoport wrote:
> >[...]
> >>>I've just been thinking that maybe it would be possible to use
> >>>UFFD_EVENT_REMOVE for this case. We anyway need to implement the generation
> >>>of UFFD_EVENT_REMOVE for the case of hole punching in hugetlbfs for
> >>>non-cooperative userfaultfd. It could be that it will solve your issue as
> >>>well.
> >>>
> >>Will this result in a signal delivery?
> >>
> >>In the use case described, the database application does not need any event
> >>for hole punching. Basically, just a signal for any invalid access to
> >>mapped area over holes in the file.
> >OK, but it would be better to think that through for other potential
> >usecases so that this doesn't end up as a single hugetlb feature. E.g.
> >what should happen if a regular anonymous memory gets swapped out?
> >Should we deliver signal as well? How does userspace tell whether this
> >was a no backing page from unavailable backing page?
>
> This may not be useful in all cases. Potential, it could be used
> with use of mlock() on anonymous memory to ensure any access
> to memory that is not locked is caught, again for robustness
> purpose.
The thing I wanted to point out is that not only this should be a single
usecase thing (I believe others will pop out as well - see below) but it
should also be well defined as this is a user visible API. Please try to
write a patch to the userfaultfd man page to clarify the exact semantic.
This should help the further discussion.
As an aside, I rememeber that prior to MADV_FREE there was long
discussion about lazy freeing of memory from userspace. Some users
wanted to be signalled when their memory was freed by the system so that
they could rebuild the original content (e.g. uncompressed images in
memory). It seems like MADV_FREE + this signalling could be used for
that usecase. John would surely know more about those usecases.
--
Michal Hocko
SUSE Labs
On Fri, Jun 30, 2017 at 11:47:35AM +0200, Michal Hocko wrote:
> [CC John, the thread started
> http://lkml.kernel.org/r/[email protected]]
>
> On Thu 29-06-17 14:41:22, prakash.sangappa wrote:
> >
> >
> > On 06/29/2017 01:09 AM, Michal Hocko wrote:
> > >On Wed 28-06-17 11:23:32, Prakash Sangappa wrote:
> > >>
> > >>On 6/28/17 6:18 AM, Mike Rapoport wrote:
> > >[...]
> > >>>I've just been thinking that maybe it would be possible to use
> > >>>UFFD_EVENT_REMOVE for this case. We anyway need to implement the generation
> > >>>of UFFD_EVENT_REMOVE for the case of hole punching in hugetlbfs for
> > >>>non-cooperative userfaultfd. It could be that it will solve your issue as
> > >>>well.
> > >>>
> > >>Will this result in a signal delivery?
> > >>
> > >>In the use case described, the database application does not need any event
> > >>for hole punching. Basically, just a signal for any invalid access to
> > >>mapped area over holes in the file.
> > >OK, but it would be better to think that through for other potential
> > >usecases so that this doesn't end up as a single hugetlb feature. E.g.
> > >what should happen if a regular anonymous memory gets swapped out?
> > >Should we deliver signal as well? How does userspace tell whether this
> > >was a no backing page from unavailable backing page?
> >
> > This may not be useful in all cases. Potential, it could be used
> > with use of mlock() on anonymous memory to ensure any access
> > to memory that is not locked is caught, again for robustness
> > purpose.
>
> The thing I wanted to point out is that not only this should be a single
> usecase thing (I believe others will pop out as well - see below) but it
> should also be well defined as this is a user visible API. Please try to
> write a patch to the userfaultfd man page to clarify the exact semantic.
> This should help the further discussion.
>
> As an aside, I rememeber that prior to MADV_FREE there was long
> discussion about lazy freeing of memory from userspace. Some users
> wanted to be signalled when their memory was freed by the system so that
> they could rebuild the original content (e.g. uncompressed images in
> memory). It seems like MADV_FREE + this signalling could be used for
> that usecase. John would surely know more about those usecases.
That would provide an equivalent API to the one volatile pages
provided agreed. So it would allow to adapt code (if any?) more easily
to drop the duplicate feature in volatile pages code (however it would
be faster if the userland code using volatile pages lazy reclaim mode
was converted to poll the uffd so the kernel talks directly to the
monitor without involving a SIGBUS signal handler which will cause
spurious enter/exit if compared to signal-less uffd API).
The main benefit in my view is not volatile pages but that
UFFD_FEATURE_SIGBUS would work equally well to enforce robustness on
all kind of memory not only hugetlbfs (so one could run the database
with robustness on THP over tmpfs) and the new cache can be injected
in the filesystem using UFFDIO_COPY which is likely faster than
fallocate as UFFDIO_COPY was already demonstrated to be faster even
than a regular page fault.
It's also simpler to handle backwards compatibility with the
UFFDIO_API call, that allows probing if UFFD_FEATURE_SIGBUS is
supported by the running kernel regardless of kernel version (so it
can be backported and enabled by the database, without the database
noticing it's on a older kernel version).
So while this wasn't the intended way to use the userfault and I
already pointed out the possibility to use a single monitor to do all
this, I'm positive about UFFD_FEATURE_SIGBUS if the overhead of having
a monitor is so concerning.
Ultimately there are many pros and just a single cons: the branch in
handle_userfault().
I wonder if it would be possible to use static_branch_enable() in
UFFDIO_API and static_branch_unlikely in handle_userfault() to
eliminate that branch but perhaps it's overkill and UFFDIO_API is
unprivileged and it would send an IPI to all CPUs. I don't think we
normally expose the static_branch_enable() to unprivileged userland
and making UFFD_FEATURE_SIGBUS a privileged op doesn't sound
attractive (although the alternative of altering a hugetlbfs mount
option would be a privileged op).
Thanks,
Andrea
On 6/30/2017 6:08 AM, Andrea Arcangeli wrote:
> On Fri, Jun 30, 2017 at 11:47:35AM +0200, Michal Hocko wrote:
[...]
>> As an aside, I rememeber that prior to MADV_FREE there was long
>> discussion about lazy freeing of memory from userspace. Some users
>> wanted to be signalled when their memory was freed by the system so that
>> they could rebuild the original content (e.g. uncompressed images in
>> memory). It seems like MADV_FREE + this signalling could be used for
>> that usecase. John would surely know more about those usecases.
> That would provide an equivalent API to the one volatile pages
> provided agreed. So it would allow to adapt code (if any?) more easily
> to drop the duplicate feature in volatile pages code (however it would
> be faster if the userland code using volatile pages lazy reclaim mode
> was converted to poll the uffd so the kernel talks directly to the
> monitor without involving a SIGBUS signal handler which will cause
> spurious enter/exit if compared to signal-less uffd API).
>
> The main benefit in my view is not volatile pages but that
> UFFD_FEATURE_SIGBUS would work equally well to enforce robustness on
> all kind of memory not only hugetlbfs (so one could run the database
> with robustness on THP over tmpfs) and the new cache can be injected
> in the filesystem using UFFDIO_COPY which is likely faster than
> fallocate as UFFDIO_COPY was already demonstrated to be faster even
> than a regular page fault.
Interesting that UFFDIO_COPY is faster then fallocate(). In the DB use case
the page does not need to be allocated at the time a process trips on
the hugetlbfs
file hole and receives SIGBUS. fallocate() is called on the hugetlbfs file,
when more memory needs to be allocated by a separate process.
> It's also simpler to handle backwards compatibility with the
> UFFDIO_API call, that allows probing if UFFD_FEATURE_SIGBUS is
> supported by the running kernel regardless of kernel version (so it
> can be backported and enabled by the database, without the database
> noticing it's on a older kernel version).
Yes, this is useful as this change will need to be back ported.
> So while this wasn't the intended way to use the userfault and I
> already pointed out the possibility to use a single monitor to do all
> this, I'm positive about UFFD_FEATURE_SIGBUS if the overhead of having
> a monitor is so concerning.
>
> Ultimately there are many pros and just a single cons: the branch in
> handle_userfault().
>
> I wonder if it would be possible to use static_branch_enable() in
> UFFDIO_API and static_branch_unlikely in handle_userfault() to
> eliminate that branch but perhaps it's overkill and UFFDIO_API is
> unprivileged and it would send an IPI to all CPUs. I don't think we
> normally expose the static_branch_enable() to unprivileged userland
> and making UFFD_FEATURE_SIGBUS a privileged op doesn't sound
> attractive (although the alternative of altering a hugetlbfs mount
> option would be a privileged op).
Regarding hugetlbfs mount option, one consideration is to allow mounts of
hugetlbfs inside user namespaces's mount namespace. Which would allow
non privileged processes to mount hugetlbfs for use inside a user
namespace.
This may be needed even for the 'min_size' mount option using which an
application could reserve huge pages and mount a filesystem for its use,
with out the need to have privileges given the system has enough hugepages
configured. It seems if non privileged processes are allowed to mount
hugetlbfs
filesystem, then min_size should be subject to some resource limits.
Mounting inside user namespace will be a different patch proposal later.
>
> Thanks,
> Andrea
On Fri, Jun 30, 2017 at 05:55:08PM -0700, prakash sangappa wrote:
> Interesting that UFFDIO_COPY is faster then fallocate(). In the DB use case
> the page does not need to be allocated at the time a process trips on
> the hugetlbfs
> file hole and receives SIGBUS. fallocate() is called on the hugetlbfs file,
> when more memory needs to be allocated by a separate process.
The major difference is that with UFFDIO_COPY the hugepage will be
immediately mapped into the virtual address without requiring any
further minor fault. So it's ideal if you could arrange to call
UFFDIO_COPY from the same process that is going to touch and use the
hugetlbfs data immediately after. You would eliminate a minor fault
that way.
UFFDIO_COPY at least for anon was measured to perform better than a
regular page fault too.
> Regarding hugetlbfs mount option, one consideration is to allow mounts of
> hugetlbfs inside user namespaces's mount namespace. Which would allow
> non privileged processes to mount hugetlbfs for use inside a user
> namespace.
> This may be needed even for the 'min_size' mount option using which an
> application could reserve huge pages and mount a filesystem for its use,
> with out the need to have privileges given the system has enough hugepages
> configured. It seems if non privileged processes are allowed to mount
> hugetlbfs
> filesystem, then min_size should be subject to some resource limits.
>
> Mounting inside user namespace will be a different patch proposal later.
There's no particular reason to make UFFDIO_FEATURE_SIGBUS a
privileged op unless we want to eliminate the branch with the static
key, so it's certainly simpler than dealing with hugetlbfs min_size
reserves.
I'm positive about the UFFDIO_FEATURE_SIGBUS tradeoffs, but others
feel free to comment.
If you could make second patch to extend the selftest to exercise and
validates UFFDIO_FEATURE_SIGBUS in anon/shmem/hugetlbfs it'd be great.
Thanks,
Andrea
On Fri, Jun 30, 2017 at 2:47 AM, Michal Hocko <[email protected]> wrote:
> [CC John, the thread started
> http://lkml.kernel.org/r/[email protected]]
>
> On Thu 29-06-17 14:41:22, prakash.sangappa wrote:
>>
>>
>> On 06/29/2017 01:09 AM, Michal Hocko wrote:
>> >On Wed 28-06-17 11:23:32, Prakash Sangappa wrote:
>> >>
>> >>On 6/28/17 6:18 AM, Mike Rapoport wrote:
>> >[...]
>> >>>I've just been thinking that maybe it would be possible to use
>> >>>UFFD_EVENT_REMOVE for this case. We anyway need to implement the generation
>> >>>of UFFD_EVENT_REMOVE for the case of hole punching in hugetlbfs for
>> >>>non-cooperative userfaultfd. It could be that it will solve your issue as
>> >>>well.
>> >>>
>> >>Will this result in a signal delivery?
>> >>
>> >>In the use case described, the database application does not need any event
>> >>for hole punching. Basically, just a signal for any invalid access to
>> >>mapped area over holes in the file.
>> >OK, but it would be better to think that through for other potential
>> >usecases so that this doesn't end up as a single hugetlb feature. E.g.
>> >what should happen if a regular anonymous memory gets swapped out?
>> >Should we deliver signal as well? How does userspace tell whether this
>> >was a no backing page from unavailable backing page?
>>
>> This may not be useful in all cases. Potential, it could be used
>> with use of mlock() on anonymous memory to ensure any access
>> to memory that is not locked is caught, again for robustness
>> purpose.
>
> The thing I wanted to point out is that not only this should be a single
> usecase thing (I believe others will pop out as well - see below) but it
> should also be well defined as this is a user visible API. Please try to
> write a patch to the userfaultfd man page to clarify the exact semantic.
> This should help the further discussion.
>
> As an aside, I rememeber that prior to MADV_FREE there was long
> discussion about lazy freeing of memory from userspace. Some users
> wanted to be signalled when their memory was freed by the system so that
> they could rebuild the original content (e.g. uncompressed images in
> memory). It seems like MADV_FREE + this signalling could be used for
> that usecase. John would surely know more about those usecases.
Sorry for being slow to reply here. The main usecase for Android is
explicit marking and unmarking of volatile pages, where the userspace
is notified if any pages were purged when it sets a page range
non-volatile, and no access of volatile pages are made before they are
marked non-volatile.
As part of my generalization for the API, there were other users
interested in the marking pages volatile, and then optimistically
using the pages w/o marking them non-volatile. Then only when the user
touched a purged volatile page they would then get a signal they could
handle to mark the pages non-volatile and re-generate the data.
This second use case seems like it would be potentially doable with
the userfaultfd interface, but I'm not sure I see how we could fit the
first use case (which Android's ashmem provides) with it (at least in
an efficient way).
thanks
-john
On 07/04/2017 09:40 AM, Andrea Arcangeli wrote:
> On Fri, Jun 30, 2017 at 05:55:08PM -0700, prakash sangappa wrote:
>> Interesting that UFFDIO_COPY is faster then fallocate(). In the DB use case
>> the page does not need to be allocated at the time a process trips on
>> the hugetlbfs
>> file hole and receives SIGBUS. fallocate() is called on the hugetlbfs file,
>> when more memory needs to be allocated by a separate process.
> The major difference is that with UFFDIO_COPY the hugepage will be
> immediately mapped into the virtual address without requiring any
> further minor fault. So it's ideal if you could arrange to call
> UFFDIO_COPY from the same process that is going to touch and use the
> hugetlbfs data immediately after. You would eliminate a minor fault
> that way.
Ok, we will see how it could be used in the DB use case.
>
> UFFDIO_COPY at least for anon was measured to perform better than a
> regular page fault too.
>> Regarding hugetlbfs mount option, one consideration is to allow mounts of
>> hugetlbfs inside user namespaces's mount namespace. Which would allow
>> non privileged processes to mount hugetlbfs for use inside a user
>> namespace.
>> This may be needed even for the 'min_size' mount option using which an
>> application could reserve huge pages and mount a filesystem for its use,
>> with out the need to have privileges given the system has enough hugepages
>> configured. It seems if non privileged processes are allowed to mount
>> hugetlbfs
>> filesystem, then min_size should be subject to some resource limits.
>>
>> Mounting inside user namespace will be a different patch proposal later.
> There's no particular reason to make UFFDIO_FEATURE_SIGBUS a
> privileged op unless we want to eliminate the branch with the static
> key, so it's certainly simpler than dealing with hugetlbfs min_size
> reserves.
Ok, so, for now will not make UFFDIO_FEATURE_SIGBUS
a privileged op and not use the static key to eliminate the
branch.
> I'm positive about the UFFDIO_FEATURE_SIGBUS tradeoffs, but others
> feel free to comment.
>
> If you could make second patch to extend the selftest to exercise and
> validates UFFDIO_FEATURE_SIGBUS in anon/shmem/hugetlbfs it'd be great.
Sure, I will update the tests and send a patch.
Thanks,
-Prakash.
>
> Thanks,
> Andrea