This adds 2 helpers to change the locked_vm counter:
- try_increase_locked_vm - may fail if new locked_vm value will be greater
than the RLIMIT_MEMLOCK limit;
- decrease_locked_vm.
These will be used by drivers capable of locking memory by userspace
request. For example, VFIO can use it to check if it can lock DMA memory
or PPC-KVM can use it to check if it can lock memory for TCE tables.
Signed-off-by: Alexey Kardashevskiy <[email protected]>
---
include/linux/mm.h | 3 +++
mm/mlock.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 52 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e03dd29..1cb219d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2113,5 +2113,8 @@ void __init setup_nr_node_ids(void);
static inline void setup_nr_node_ids(void) {}
#endif
+extern long try_increment_locked_vm(long npages);
+extern void decrement_locked_vm(long npages);
+
#endif /* __KERNEL__ */
#endif /* _LINUX_MM_H */
diff --git a/mm/mlock.c b/mm/mlock.c
index b1eb536..39e4b55 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -864,3 +864,52 @@ void user_shm_unlock(size_t size, struct user_struct *user)
spin_unlock(&shmlock_user_lock);
free_uid(user);
}
+
+/**
+ * try_increment_locked_vm() - checks if new locked_vm value is going to
+ * be less than RLIMIT_MEMLOCK and increments it by npages if it is.
+ *
+ * @npages: the number of pages to add to locked_vm.
+ *
+ * Returns 0 if succeeded or negative value if failed.
+ */
+long try_increment_locked_vm(long npages)
+{
+ long ret = 0, locked, lock_limit;
+
+ if (!current || !current->mm)
+ return -ESRCH; /* process exited */
+
+ down_write(¤t->mm->mmap_sem);
+ locked = current->mm->locked_vm + npages;
+ lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+ if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
+ pr_warn("RLIMIT_MEMLOCK (%ld) exceeded\n",
+ rlimit(RLIMIT_MEMLOCK));
+ ret = -ENOMEM;
+ } else {
+ current->mm->locked_vm += npages;
+ }
+ up_write(¤t->mm->mmap_sem);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(try_increment_locked_vm);
+
+/**
+ * decrement_locked_vm() - decrements the current task's locked_vm counter.
+ *
+ * @npages: the number to decrement by.
+ */
+void decrement_locked_vm(long npages)
+{
+ if (!current || !current->mm)
+ return; /* process exited */
+
+ down_write(¤t->mm->mmap_sem);
+ if (npages > current->mm->locked_vm)
+ npages = current->mm->locked_vm;
+ current->mm->locked_vm -= npages;
+ up_write(¤t->mm->mmap_sem);
+}
+EXPORT_SYMBOL_GPL(decrement_locked_vm);
--
2.0.0
On Wed, Jul 30, 2014 at 07:28:13PM +1000, Alexey Kardashevskiy wrote:
> This adds 2 helpers to change the locked_vm counter:
> - try_increase_locked_vm - may fail if new locked_vm value will be greater
> than the RLIMIT_MEMLOCK limit;
> - decrease_locked_vm.
>
> These will be used by drivers capable of locking memory by userspace
> request. For example, VFIO can use it to check if it can lock DMA memory
> or PPC-KVM can use it to check if it can lock memory for TCE tables.
Urgh no.. have a look here:
lkml.kernel.org/r/[email protected]
Someone needs to go help with the IB stuff, I got properly lost in
there.
On Wed, 2014-07-30 at 19:28 +1000, Alexey Kardashevskiy wrote:
> This adds 2 helpers to change the locked_vm counter:
> - try_increase_locked_vm - may fail if new locked_vm value will be greater
> than the RLIMIT_MEMLOCK limit;
> - decrease_locked_vm.
>
> These will be used by drivers capable of locking memory by userspace
> request. For example, VFIO can use it to check if it can lock DMA memory
> or PPC-KVM can use it to check if it can lock memory for TCE tables.
>
> Signed-off-by: Alexey Kardashevskiy <[email protected]>
> ---
> include/linux/mm.h | 3 +++
> mm/mlock.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 52 insertions(+)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index e03dd29..1cb219d 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2113,5 +2113,8 @@ void __init setup_nr_node_ids(void);
> static inline void setup_nr_node_ids(void) {}
> #endif
>
> +extern long try_increment_locked_vm(long npages);
> +extern void decrement_locked_vm(long npages);
> +
> #endif /* __KERNEL__ */
> #endif /* _LINUX_MM_H */
> diff --git a/mm/mlock.c b/mm/mlock.c
> index b1eb536..39e4b55 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -864,3 +864,52 @@ void user_shm_unlock(size_t size, struct user_struct *user)
> spin_unlock(&shmlock_user_lock);
> free_uid(user);
> }
> +
> +/**
> + * try_increment_locked_vm() - checks if new locked_vm value is going to
> + * be less than RLIMIT_MEMLOCK and increments it by npages if it is.
> + *
> + * @npages: the number of pages to add to locked_vm.
> + *
> + * Returns 0 if succeeded or negative value if failed.
> + */
> +long try_increment_locked_vm(long npages)
mlock calls work at an address granularity...
> +{
> + long ret = 0, locked, lock_limit;
> +
> + if (!current || !current->mm)
> + return -ESRCH; /* process exited */
It doesn't strike me that this is the place for this. It would seem that
it would be the caller's responsibility to make sure of this (and not
sure how !current can happen...).
> +
> + down_write(¤t->mm->mmap_sem);
> + locked = current->mm->locked_vm + npages;
> + lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
nit: please set locked and lock_limit before taking the mmap_sem.
> + if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
> + pr_warn("RLIMIT_MEMLOCK (%ld) exceeded\n",
> + rlimit(RLIMIT_MEMLOCK));
> + ret = -ENOMEM;
> + } else {
It would be nicer to have it the other way around, leave the #else for
ENOMEM. It reads better, imho.
> + current->mm->locked_vm += npages;
More importantly just setting locked_vm is not enough. You'll need to
call do_mlock() here (again, addr granularity ;). This also applies to
your decrement_locked_vm().
Thanks,
Davidlohr
On 07/30/2014 08:31 PM, Davidlohr Bueso wrote:
> On Wed, 2014-07-30 at 19:28 +1000, Alexey Kardashevskiy wrote:
>> This adds 2 helpers to change the locked_vm counter:
>> - try_increase_locked_vm - may fail if new locked_vm value will be greater
>> than the RLIMIT_MEMLOCK limit;
>> - decrease_locked_vm.
>>
>> These will be used by drivers capable of locking memory by userspace
>> request. For example, VFIO can use it to check if it can lock DMA memory
>> or PPC-KVM can use it to check if it can lock memory for TCE tables.
>>
>> Signed-off-by: Alexey Kardashevskiy <[email protected]>
>> ---
>> include/linux/mm.h | 3 +++
>> mm/mlock.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
>> 2 files changed, 52 insertions(+)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index e03dd29..1cb219d 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -2113,5 +2113,8 @@ void __init setup_nr_node_ids(void);
>> static inline void setup_nr_node_ids(void) {}
>> #endif
>>
>> +extern long try_increment_locked_vm(long npages);
>> +extern void decrement_locked_vm(long npages);
>> +
>> #endif /* __KERNEL__ */
>> #endif /* _LINUX_MM_H */
>> diff --git a/mm/mlock.c b/mm/mlock.c
>> index b1eb536..39e4b55 100644
>> --- a/mm/mlock.c
>> +++ b/mm/mlock.c
>> @@ -864,3 +864,52 @@ void user_shm_unlock(size_t size, struct user_struct *user)
>> spin_unlock(&shmlock_user_lock);
>> free_uid(user);
>> }
>> +
>> +/**
>> + * try_increment_locked_vm() - checks if new locked_vm value is going to
>> + * be less than RLIMIT_MEMLOCK and increments it by npages if it is.
>> + *
>> + * @npages: the number of pages to add to locked_vm.
>> + *
>> + * Returns 0 if succeeded or negative value if failed.
>> + */
>> +long try_increment_locked_vm(long npages)
>
> mlock calls work at an address granularity...
>
>> +{
>> + long ret = 0, locked, lock_limit;
>> +
>> + if (!current || !current->mm)
>> + return -ESRCH; /* process exited */
>
> It doesn't strike me that this is the place for this. It would seem that
> it would be the caller's responsibility to make sure of this (and not
> sure how !current can happen...).
>
>> +
>> + down_write(¤t->mm->mmap_sem);
>> + locked = current->mm->locked_vm + npages;
>> + lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
>
> nit: please set locked and lock_limit before taking the mmap_sem.
>
>> + if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
>> + pr_warn("RLIMIT_MEMLOCK (%ld) exceeded\n",
>> + rlimit(RLIMIT_MEMLOCK));
>> + ret = -ENOMEM;
>> + } else {
>
> It would be nicer to have it the other way around, leave the #else for
> ENOMEM. It reads better, imho.
>
>> + current->mm->locked_vm += npages;
>
> More importantly just setting locked_vm is not enough. You'll need to
> call do_mlock() here (again, addr granularity ;). This also applies to
> your decrement_locked_vm().
Uff. Bad commit log :(
No, this is not my intention here. Here I only want to increment the counter.
The whole problem is like this: there is VFIO (PCI passthru) and the guest
which gets a real PCI device wants to use some of guest RAM for DMA so we
need to pin this memory. PPC64-pseries specific is:
1. only part of guest RAM can be used for DMA, so called "window", and we
do not know in advance what part of guest RAM has to be pinned; the window
is never guaranteed to have a specific size like "whole guest RAM" and even
if we wanted to pin the entire guest RAM - we cannot do this as we do not
know the guest's RAM size if it is not KVM;
2. we could do this counting and locking in real time but this is not
possible from real mode (MMU off) and will slow things down.
So the trick is we do not let the guest (QEMU in full emulation or KVM,
does not matter here) use VFIO at all if it cannot increment the locked_vm
counter in advance. No locking needs to done at the moment of the guest's
start.
--
Alexey
On Wed, Jul 30, 2014 at 10:30:48PM +1000, Alexey Kardashevskiy wrote:
>
> No, this is not my intention here. Here I only want to increment the counter.
Full and hard nack on that. It should always be tied to actual pages, we
should not detach this and make it 'a number'.