2020-06-16 07:58:14

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 2/6] arm64/vdso: Zap vvar pages when switching to a time namespace

The VVAR page layout depends on whether a task belongs to the root or
non-root time namespace. Whenever a task changes its namespace, the VVAR
page tables are cleared and then they will be re-faulted with a
corresponding layout.

Reviewed-by: Vincenzo Frascino <[email protected]>
Reviewed-by: Dmitry Safonov <[email protected]>
Signed-off-by: Andrei Vagin <[email protected]>
---
arch/arm64/kernel/vdso.c | 32 ++++++++++++++++++++++++++++++++
1 file changed, 32 insertions(+)

diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c
index b0aec4e8c9b4..df4bb736d28a 100644
--- a/arch/arm64/kernel/vdso.c
+++ b/arch/arm64/kernel/vdso.c
@@ -125,6 +125,38 @@ static int __vdso_init(enum vdso_abi abi)
return 0;
}

+#ifdef CONFIG_TIME_NS
+/*
+ * The vvar page layout depends on whether a task belongs to the root or
+ * non-root time namespace. Whenever a task changes its namespace, the VVAR
+ * page tables are cleared and then they will re-faulted with a
+ * corresponding layout.
+ * See also the comment near timens_setup_vdso_data() for details.
+ */
+int vdso_join_timens(struct task_struct *task, struct time_namespace *ns)
+{
+ struct mm_struct *mm = task->mm;
+ struct vm_area_struct *vma;
+
+ if (mmap_write_lock_killable(mm))
+ return -EINTR;
+
+ for (vma = mm->mmap; vma; vma = vma->vm_next) {
+ unsigned long size = vma->vm_end - vma->vm_start;
+
+ if (vma_is_special_mapping(vma, vdso_info[VDSO_ABI_AA64].dm))
+ zap_page_range(vma, vma->vm_start, size);
+#ifdef CONFIG_COMPAT_VDSO
+ if (vma_is_special_mapping(vma, vdso_info[VDSO_ABI_AA32].dm))
+ zap_page_range(vma, vma->vm_start, size);
+#endif
+ }
+
+ mmap_write_unlock(mm);
+ return 0;
+}
+#endif
+
static vm_fault_t vvar_fault(const struct vm_special_mapping *sm,
struct vm_area_struct *vma, struct vm_fault *vmf)
{
--
2.24.1


2020-06-16 11:27:56

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH 2/6] arm64/vdso: Zap vvar pages when switching to a time namespace

On Tue, Jun 16, 2020 at 12:55:41AM -0700, Andrei Vagin wrote:
> The VVAR page layout depends on whether a task belongs to the root or
> non-root time namespace.

Please be more explicit as to what you mean by `layout` here, as that seems to
be an overloaded term. For example, the comment above timens_setup_vdso_data()
can be read to directly contradict this:

| A time namespace VVAR page has the same layout as the VVAR page which
| contains the system wide VDSO data.

... I think you're trying to say is that when we add time namespace support,
we'll have multiple VVAR pages, and their position in the address space depends
on whether the task is part of a time namespace.

> Whenever a task changes its namespace, the VVAR
> page tables are cleared and then they will be re-faulted with a
> corresponding layout.

How does this work for multi-threaded applications? Are there any
concerns w.r.t. atomicity of the change?

> Reviewed-by: Vincenzo Frascino <[email protected]>
> Reviewed-by: Dmitry Safonov <[email protected]>
> Signed-off-by: Andrei Vagin <[email protected]>
> ---
> arch/arm64/kernel/vdso.c | 32 ++++++++++++++++++++++++++++++++
> 1 file changed, 32 insertions(+)
>
> diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c
> index b0aec4e8c9b4..df4bb736d28a 100644
> --- a/arch/arm64/kernel/vdso.c
> +++ b/arch/arm64/kernel/vdso.c
> @@ -125,6 +125,38 @@ static int __vdso_init(enum vdso_abi abi)
> return 0;
> }
>
> +#ifdef CONFIG_TIME_NS
> +/*
> + * The vvar page layout depends on whether a task belongs to the root or
> + * non-root time namespace. Whenever a task changes its namespace, the VVAR
> + * page tables are cleared and then they will re-faulted with a
> + * corresponding layout.
> + * See also the comment near timens_setup_vdso_data() for details.
> + */

As with the commit message, this is not very clear and can be read to
contradict the comment it refers to, which is rather unhelpful.

How about:

/*
* The vvar mapping contains data for a specific time namespace, so when
* a task changes namespace we must unmap its vvar data for the old
* namespace. Subsequent faults will map in data for the new namespace.
*
* For more details see timens_setup_vdso_data().
*/

Thanks,
Mark.

> +int vdso_join_timens(struct task_struct *task, struct time_namespace *ns)
> +{
> + struct mm_struct *mm = task->mm;
> + struct vm_area_struct *vma;
> +
> + if (mmap_write_lock_killable(mm))
> + return -EINTR;
> +
> + for (vma = mm->mmap; vma; vma = vma->vm_next) {
> + unsigned long size = vma->vm_end - vma->vm_start;
> +
> + if (vma_is_special_mapping(vma, vdso_info[VDSO_ABI_AA64].dm))
> + zap_page_range(vma, vma->vm_start, size);
> +#ifdef CONFIG_COMPAT_VDSO
> + if (vma_is_special_mapping(vma, vdso_info[VDSO_ABI_AA32].dm))
> + zap_page_range(vma, vma->vm_start, size);
> +#endif
> + }
> +
> + mmap_write_unlock(mm);
> + return 0;
> +}
> +#endif
> +
> static vm_fault_t vvar_fault(const struct vm_special_mapping *sm,
> struct vm_area_struct *vma, struct vm_fault *vmf)
> {
> --
> 2.24.1
>

2020-06-16 13:52:06

by Dmitry Safonov

[permalink] [raw]
Subject: Re: [PATCH 2/6] arm64/vdso: Zap vvar pages when switching to a time namespace

Hi Mark,

On 6/16/20 12:24 PM, Mark Rutland wrote:
> On Tue, Jun 16, 2020 at 12:55:41AM -0700, Andrei Vagin wrote:
[..]
>> Whenever a task changes its namespace, the VVAR
>> page tables are cleared and then they will be re-faulted with a
>> corresponding layout.
>
> How does this work for multi-threaded applications? Are there any
> concerns w.r.t. atomicity of the change?

Multi-threaded applications can't setns() for time namespace,
timens_install():

: if (!current_is_single_threaded())
: return -EUSERS;

Thanks,
Dmitry

2020-06-19 15:43:18

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH 2/6] arm64/vdso: Zap vvar pages when switching to a time namespace

On Tue, Jun 16, 2020 at 12:55:41AM -0700, Andrei Vagin wrote:
> The VVAR page layout depends on whether a task belongs to the root or
> non-root time namespace. Whenever a task changes its namespace, the VVAR
> page tables are cleared and then they will be re-faulted with a
> corresponding layout.
>
> Reviewed-by: Vincenzo Frascino <[email protected]>
> Reviewed-by: Dmitry Safonov <[email protected]>
> Signed-off-by: Andrei Vagin <[email protected]>
> ---
> arch/arm64/kernel/vdso.c | 32 ++++++++++++++++++++++++++++++++
> 1 file changed, 32 insertions(+)
>
> diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c
> index b0aec4e8c9b4..df4bb736d28a 100644
> --- a/arch/arm64/kernel/vdso.c
> +++ b/arch/arm64/kernel/vdso.c
> @@ -125,6 +125,38 @@ static int __vdso_init(enum vdso_abi abi)
> return 0;
> }
>
> +#ifdef CONFIG_TIME_NS
> +/*
> + * The vvar page layout depends on whether a task belongs to the root or
> + * non-root time namespace. Whenever a task changes its namespace, the VVAR
> + * page tables are cleared and then they will re-faulted with a
> + * corresponding layout.
> + * See also the comment near timens_setup_vdso_data() for details.
> + */
> +int vdso_join_timens(struct task_struct *task, struct time_namespace *ns)
> +{
> + struct mm_struct *mm = task->mm;
> + struct vm_area_struct *vma;
> +
> + if (mmap_write_lock_killable(mm))
> + return -EINTR;

Hey,

Just a heads-up I'm about to plumb CLONE_NEWTIME support into setns()
which would mean that vdso_join_timens() ould not be allowed to fail
anymore to make it easy to switch to multiple namespaces atomically. So
this would probably need to be changed to mmap_write_lock() which I've
already brought up upstream:
https://lore.kernel.org/lkml/20200611110221.pgd3r5qkjrjmfqa2@wittgenstein/
(Assuming that people agree. I just sent the series and most people here
are Cced.)

Thanks!
Christian

2020-06-23 07:37:10

by Andrei Vagin

[permalink] [raw]
Subject: Re: [PATCH 2/6] arm64/vdso: Zap vvar pages when switching to a time namespace

On Fri, Jun 19, 2020 at 05:38:12PM +0200, Christian Brauner wrote:
> On Tue, Jun 16, 2020 at 12:55:41AM -0700, Andrei Vagin wrote:
> > The VVAR page layout depends on whether a task belongs to the root or
> > non-root time namespace. Whenever a task changes its namespace, the VVAR
> > page tables are cleared and then they will be re-faulted with a
> > corresponding layout.
> >
> > Reviewed-by: Vincenzo Frascino <[email protected]>
> > Reviewed-by: Dmitry Safonov <[email protected]>
> > Signed-off-by: Andrei Vagin <[email protected]>
> > ---
> > arch/arm64/kernel/vdso.c | 32 ++++++++++++++++++++++++++++++++
> > 1 file changed, 32 insertions(+)
> >
> > diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c
> > index b0aec4e8c9b4..df4bb736d28a 100644
> > --- a/arch/arm64/kernel/vdso.c
> > +++ b/arch/arm64/kernel/vdso.c
> > @@ -125,6 +125,38 @@ static int __vdso_init(enum vdso_abi abi)
> > return 0;
> > }
> >
> > +#ifdef CONFIG_TIME_NS
> > +/*
> > + * The vvar page layout depends on whether a task belongs to the root or
> > + * non-root time namespace. Whenever a task changes its namespace, the VVAR
> > + * page tables are cleared and then they will re-faulted with a
> > + * corresponding layout.
> > + * See also the comment near timens_setup_vdso_data() for details.
> > + */
> > +int vdso_join_timens(struct task_struct *task, struct time_namespace *ns)
> > +{
> > + struct mm_struct *mm = task->mm;
> > + struct vm_area_struct *vma;
> > +
> > + if (mmap_write_lock_killable(mm))
> > + return -EINTR;
>
> Hey,
>
> Just a heads-up I'm about to plumb CLONE_NEWTIME support into setns()

Hmm. I am not sure that I unserstand what you mean. I think setns(nsfd,
CLONE_NEWTIME) works now. For example, we use it in
tools/testing/selftests/timens/timens.c. Do you mean setns(pidfd,
CLONE_NEWTIME | CLONE_something)?

> which would mean that vdso_join_timens() ould not be allowed to fail
> anymore to make it easy to switch to multiple namespaces atomically. So
> this would probably need to be changed to mmap_write_lock() which I've
> already brought up upstream:
> https://lore.kernel.org/lkml/20200611110221.pgd3r5qkjrjmfqa2@wittgenstein/
> (Assuming that people agree. I just sent the series and most people here
> are Cced.)
>
> Thanks!
> Christian

2020-06-23 08:45:44

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH 2/6] arm64/vdso: Zap vvar pages when switching to a time namespace

On Tue, Jun 23, 2020 at 12:33:05AM -0700, Andrei Vagin wrote:
> On Fri, Jun 19, 2020 at 05:38:12PM +0200, Christian Brauner wrote:
> > On Tue, Jun 16, 2020 at 12:55:41AM -0700, Andrei Vagin wrote:
> > > The VVAR page layout depends on whether a task belongs to the root or
> > > non-root time namespace. Whenever a task changes its namespace, the VVAR
> > > page tables are cleared and then they will be re-faulted with a
> > > corresponding layout.
> > >
> > > Reviewed-by: Vincenzo Frascino <[email protected]>
> > > Reviewed-by: Dmitry Safonov <[email protected]>
> > > Signed-off-by: Andrei Vagin <[email protected]>
> > > ---
> > > arch/arm64/kernel/vdso.c | 32 ++++++++++++++++++++++++++++++++
> > > 1 file changed, 32 insertions(+)
> > >
> > > diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c
> > > index b0aec4e8c9b4..df4bb736d28a 100644
> > > --- a/arch/arm64/kernel/vdso.c
> > > +++ b/arch/arm64/kernel/vdso.c
> > > @@ -125,6 +125,38 @@ static int __vdso_init(enum vdso_abi abi)
> > > return 0;
> > > }
> > >
> > > +#ifdef CONFIG_TIME_NS
> > > +/*
> > > + * The vvar page layout depends on whether a task belongs to the root or
> > > + * non-root time namespace. Whenever a task changes its namespace, the VVAR
> > > + * page tables are cleared and then they will re-faulted with a
> > > + * corresponding layout.
> > > + * See also the comment near timens_setup_vdso_data() for details.
> > > + */
> > > +int vdso_join_timens(struct task_struct *task, struct time_namespace *ns)
> > > +{
> > > + struct mm_struct *mm = task->mm;
> > > + struct vm_area_struct *vma;
> > > +
> > > + if (mmap_write_lock_killable(mm))
> > > + return -EINTR;
> >
> > Hey,
> >
> > Just a heads-up I'm about to plumb CLONE_NEWTIME support into setns()
>
> Hmm. I am not sure that I unserstand what you mean. I think setns(nsfd,
> CLONE_NEWTIME) works now. For example, we use it in
> tools/testing/selftests/timens/timens.c. Do you mean setns(pidfd,
> CLONE_NEWTIME | CLONE_something)?

Indeed, I'm talking about setns(pidfd, CLONE_NEWUSER | CLONE_NEWNS |
CLONE_NEWTIME). But also in general, the setns infrastructure has been
reworked and ideally all namespaces only perform permissions checks and
install the namespace into the passed in new struct nsset (which was
introduced this cycle) in their install handler (e.g. timens_install())
and don't make any task-visible changes yet but instead provide an
install routine that does not fail which is then called from
static void commit_nsset(struct nsset *nsset)
in kernel/nsproxy.c.

>
> > which would mean that vdso_join_timens() ould not be allowed to fail
> > anymore to make it easy to switch to multiple namespaces atomically. So
> > this would probably need to be changed to mmap_write_lock() which I've
> > already brought up upstream:
> > https://lore.kernel.org/lkml/20200611110221.pgd3r5qkjrjmfqa2@wittgenstein/
> > (Assuming that people agree. I just sent the series and most people here
> > are Cced.)
> >
> > Thanks!
> > Christian