Received: by 2002:a25:b794:0:0:0:0:0 with SMTP id n20csp4746058ybh; Tue, 6 Aug 2019 17:28:51 -0700 (PDT) X-Google-Smtp-Source: APXvYqyVMUX+iRynLDS4iPkaQPkGa+xePeNHOKXkc8TYFBUShjruNs5YQbY41/9E6mAjS4BedlnQ X-Received: by 2002:a17:90a:dac3:: with SMTP id g3mr5740528pjx.45.1565137731370; Tue, 06 Aug 2019 17:28:51 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1565137731; cv=none; d=google.com; s=arc-20160816; b=keEvQfXxSZG4ESVdvztgSS+S6UnnZDItGBFwNZBkRFFaeRf7ztneOKjzbcAqn4SchE EOsRLNLicp+CwxXk2TuuLCJCHpzxoUgDJ3eyrHijUdawxGOHBt+i21iaHTpFF/4qdJbF xCe8a9WfcYnkGy7SKQovFcQ8rJFJtAt559P/jA/UUcq/hjOVP+iEmsR89f7DTgccPi7N D89Buqk4/IvAIpTgzd3EdLU/uij1noUZgKs7rPzMYjyC6yJfU1ta7uy9GdwI65xwwsjA S+xlGDEZfH9N9+EshAouPRBXI6+L45DI3KHcF4rz2dAc2sfFURxr4/jhnNNlxBJq+B0h GSww== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=cDIr3JODYU7FS0xjE/N2Ulu9ZMmr6z8UFiq2vRQunkU=; b=EHZ7QehZQPzCFAVpJhHT9n8XuSd1ijYNpdjp8qOdo8P6hpdsTeTlzEDJEHTFBeU1tz aXGpjlbRuz/G9mq41b/UALBZsogPLImi3jT/44PgufzEBsYRdnrPJYMcdDUPbCumEzki iE7mwW/xT2uW/ZPug9RsMfeMonn+gfjiJK1IRSKSKSlstzyUTxGLh9XzUfSWNVHiuTrZ IPbn6QSrwyLLpSSvzIWIwS1OSujGNk505Ll4/o5XQP7O8wtAhfDQONRBxTIiqkNv6r8B FYlODEpxU5aBKSB60OCUUgDOvo7PMIWGMPKf9x+Z7LN5Ol7KanVQ4HyTH8pRrAE/nFFf AZCw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@arista.com header.s=googlenew header.b=aRgkvqOZ; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=REJECT dis=NONE) header.from=arista.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id u9si33202140pgf.198.2019.08.06.17.28.34; Tue, 06 Aug 2019 17:28:51 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@arista.com header.s=googlenew header.b=aRgkvqOZ; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=REJECT dis=NONE) header.from=arista.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726686AbfHGA1d (ORCPT + 99 others); Tue, 6 Aug 2019 20:27:33 -0400 Received: from mail-wm1-f68.google.com ([209.85.128.68]:52123 "EHLO mail-wm1-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726334AbfHGA1d (ORCPT ); Tue, 6 Aug 2019 20:27:33 -0400 Received: by mail-wm1-f68.google.com with SMTP id 207so79960150wma.1 for ; Tue, 06 Aug 2019 17:27:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=arista.com; s=googlenew; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=cDIr3JODYU7FS0xjE/N2Ulu9ZMmr6z8UFiq2vRQunkU=; b=aRgkvqOZypFL1/+cRARkYeVzqMhWIzxt01DgyFAQb995klAsM66uzb5etRCvcV8RAa YcLX99lx7OWiF7ZiiZXRAQijNuxGijp4f0Mx2yCgx0WYIdnE1rOVZqR0UL66u/mJ85LH vj3/1wmGb5dYM+72+JvRgod6AnaecJnTOZ5/VBivyjVpPRQmlLF91JHR4djNZLXeQ6rv f4q6T28H9Rv38HUhqXiQ3dMzIZi0SkF1ewfwxxAyopJ9eKDlezX8Jn69s6ysE7DSvNeL vu0NftliOnuX8yzPWg3i1SDvHhiF3zicfRkQdRuo4IP+itZYz8SbQo99ZvVY7sSzlKJ2 9iCw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=cDIr3JODYU7FS0xjE/N2Ulu9ZMmr6z8UFiq2vRQunkU=; b=IIWmmeZ5Z/xrFGy575zuOWnzz5d9OffLD8NlC1OHjeSF9jOWgmpChgkgXLEs7pdyeg /tVBBCPTmfuPuW0bo5w2N1SXoyTlTgAlG24exoW7HDUWoRokdFwY09irCZOrupR7v+0B eUzhqBxAQfMR/VGWUecFWPoDaWn+7W0duWNETB758KjRAEGc/djCtgDvYSFczHp7i9rM +BdIz9+lYzBQFt34Z8i486P2KcTiMW1jO2rfV9Qz6+9CAIA/WECcnOQ9vY7ZX35a7djp hBTsXJ4yk8iHtkDbFz6usqJwYV0FYfC3rmdVBgnIhiOeBf4YOtpUbLBmUOyXZmrQbEDU tx7Q== X-Gm-Message-State: APjAAAV6k15F64za650KTcAFRcr6/RcZeQZrAfl+J3UvRXkbbCbxhZEt 1M3KB22LXc1ruDbhKh0YX9jXHlXkcBzmTw5hKNUtFiaQnFWQ6mZ8mtBvuTu6fhngv82L0kdgLdO czHJk3rAOQJ51MuZtBC9W+tikW8YfoEdrfU+LpWpc36mRwMFlKmh4ymC0JbjEP6OAqxnTlixaUX X10q/OSSVoClW8/Hq6VJXGFciGTY5KZDHvNg== X-Received: by 2002:a1c:f115:: with SMTP id p21mr6281128wmh.134.1565137650594; Tue, 06 Aug 2019 17:27:30 -0700 (PDT) Received: from Mindolluin.ire.aristanetworks.com ([217.173.96.166]) by smtp.gmail.com with ESMTPSA id y6sm121847291wmd.16.2019.08.06.17.27.29 (version=TLS1_3 cipher=AEAD-AES256-GCM-SHA384 bits=256/256); Tue, 06 Aug 2019 17:27:29 -0700 (PDT) From: Dmitry Safonov To: dima@arista.com Cc: 0x7f454c46@gmail.com, adrian@lisas.de, arnd@arndb.de, avagin@gmail.com, avagin@openvz.org, christian.brauner@ubuntu.com, containers@lists.linux-foundation.org, criu@openvz.org, ebiederm@xmission.com, gorcunov@openvz.org, hpa@zytor.com, jannh@google.com, jdike@addtoit.com, linux-api@vger.kernel.org, linux-kernel@vger.kernel.org, luto@kernel.org, mingo@redhat.com, oleg@redhat.com, shuah@kernel.org, tglx@linutronix.de, vincenzo.frascino@arm.com, x86@kernel.org, xemul@virtuozzo.com Subject: [PATCHv6 25/37] x86/vdso: Switch image on setns()/clone() Date: Wed, 7 Aug 2019 01:27:28 +0100 Message-Id: <20190807002728.19743-1-dima@arista.com> X-Mailer: git-send-email 2.22.0 In-Reply-To: <20190729215758.28405-26-dima@arista.com> References: <20190729215758.28405-26-dima@arista.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-CLOUD-SEC-AV-Info: arista,google_mail,monitor X-CLOUD-SEC-AV-Sent: true X-Gm-Spam: 0 X-Gm-Phishy: 0 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org As it has been discussed on timens RFC, adding a new conditional branch `if (inside_time_ns)` on VDSO for all processes is undesirable. It will add a penalty for everybody as branch predictor may mispredict the jump. Also there are instruction cache lines wasted on cmp/jmp. Those effects of introducing time namespace are very much unwanted having in mind how much work have been spent on micro-optimisation vdso code. Addressing those problems, there are two versions of VDSO's .so: for host tasks (without any penalty) and for processes inside of time namespace with clk_to_ns() that subtracts offsets from host's time. Whenever a user does setns() or unshare(CLONE_TIMENS) followed by clone(), change VDSO image in mm and zap VVAR/VDSO page tables. They will be re-faulted with corresponding image and VVAR offsets. Co-developed-by: Andrei Vagin Signed-off-by: Andrei Vagin Signed-off-by: Dmitry Safonov --- v5..v6 Change: Rebased over current_is_single_threaded() change in the first patch. arch/x86/entry/vdso/vma.c | 23 +++++++++++++++++++++++ arch/x86/include/asm/vdso.h | 1 + kernel/time_namespace.c | 11 +++++++++++ 3 files changed, 35 insertions(+) diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c index 8a8211fd4cfc..91cf5a5c8c9e 100644 --- a/arch/x86/entry/vdso/vma.c +++ b/arch/x86/entry/vdso/vma.c @@ -25,6 +25,7 @@ #include #include #include +#include #if defined(CONFIG_X86_64) unsigned int __read_mostly vdso64_enabled = 1; @@ -266,6 +267,28 @@ static const struct vm_special_mapping vvar_mapping = { .mremap = vvar_mremap, }; +#ifdef CONFIG_TIME_NS +int vdso_join_timens(struct task_struct *task) +{ + struct mm_struct *mm = task->mm; + struct vm_area_struct *vma; + + if (down_write_killable(&mm->mmap_sem)) + return -EINTR; + + for (vma = mm->mmap; vma; vma = vma->vm_next) { + unsigned long size = vma->vm_end - vma->vm_start; + + if (vma_is_special_mapping(vma, &vvar_mapping) || + vma_is_special_mapping(vma, &vdso_mapping)) + zap_page_range(vma, vma->vm_start, size); + } + + up_write(&mm->mmap_sem); + return 0; +} +#endif + /* * Add vdso and vvar mappings to current process. * @image - blob to map diff --git a/arch/x86/include/asm/vdso.h b/arch/x86/include/asm/vdso.h index 03f468c63a24..ccf89dedd04f 100644 --- a/arch/x86/include/asm/vdso.h +++ b/arch/x86/include/asm/vdso.h @@ -45,6 +45,7 @@ extern struct vdso_image vdso_image_32; extern void __init init_vdso_image(struct vdso_image *image); extern int map_vdso_once(const struct vdso_image *image, unsigned long addr); +extern int vdso_join_timens(struct task_struct *task); #endif /* __ASSEMBLER__ */ diff --git a/kernel/time_namespace.c b/kernel/time_namespace.c index cdfa1b75bd0d..2e7e0af44f04 100644 --- a/kernel/time_namespace.c +++ b/kernel/time_namespace.c @@ -15,6 +15,7 @@ #include #include #include +#include ktime_t do_timens_ktime_to_host(clockid_t clockid, ktime_t tim, struct timens_offsets *ns_offsets) @@ -199,6 +200,7 @@ static void timens_put(struct ns_common *ns) static int timens_install(struct nsproxy *nsproxy, struct ns_common *new) { struct time_namespace *ns = to_time_ns(new); + int ret; if (!current_is_single_threaded()) return -EUSERS; @@ -207,6 +209,10 @@ static int timens_install(struct nsproxy *nsproxy, struct ns_common *new) !ns_capable(current_user_ns(), CAP_SYS_ADMIN)) return -EPERM; + ret = vdso_join_timens(current); + if (ret) + return ret; + get_time_ns(ns); get_time_ns(ns); put_time_ns(nsproxy->time_ns); @@ -221,10 +227,15 @@ int timens_on_fork(struct nsproxy *nsproxy, struct task_struct *tsk) { struct ns_common *nsc = &nsproxy->time_ns_for_children->ns; struct time_namespace *ns = to_time_ns(nsc); + int ret; if (nsproxy->time_ns == nsproxy->time_ns_for_children) return 0; + ret = vdso_join_timens(tsk); + if (ret) + return ret; + get_time_ns(ns); put_time_ns(nsproxy->time_ns); nsproxy->time_ns = ns; -- 2.22.0