Received: by 2002:ac0:a582:0:0:0:0:0 with SMTP id m2-v6csp2726976imm; Thu, 18 Oct 2018 21:30:33 -0700 (PDT) X-Google-Smtp-Source: ACcGV638CCRnpf2AxpZgpQDRC1dwBBiXWg+faEKv5buzvlNxdXHxxrxh0ZTT2OTtpqVEbYSGtb8Z X-Received: by 2002:a63:f922:: with SMTP id h34-v6mr31080563pgi.154.1539923433795; Thu, 18 Oct 2018 21:30:33 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1539923433; cv=none; d=google.com; s=arc-20160816; b=tqsxJee3VFCDnMTRPwK5wyxfR0gihZWWQveoe05mh1mXanIpXedIfk3RswsBKRDD41 FeimWOMXEGCdJ94zVX4Cs/u2nDo3cwg2PF4JcSZOu6F2NkUfflvMn+BbD7/3LJW5Rv7K T1AoOwRfeOBoteoFutJ6qv8BrL7ePzChCpHZTSaq7jQx/6+Ebo56hWgQxsFrw0kqdE2S v8kBlJTHZSOh6JK4eHD41hy1FusY7qAQtWfI+4M5YPh0sJ+eqLO0Vt0XgHPqU6JvuK/R s5oxsC2pG/aEEyRAyyIHJtofAR8MTAAmzYGdvf5UKgiK7q4Q/bisO9nqBT1ZwBASpyAy b1Rw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=7wuEtBPRUCjbK0AzhMft1Uu3vwmUV01CiT6kvM9rDHI=; b=t23i0k4Vbyiqfhim39IJVpfuQvCSaum5wgUTGsCj/lBovmWPykKGwOZvxkEFZ8BmEn htUPKi3hJFosVC5NRuvt3Cq1yW2nsOSgbv26VXSH2vmQlZ0cAl92N+itTdCFQP8quqSi ogyj837ACBwvAfdgz0GgLs3IH535AI7LAEcCPwcmgoTC0ImSr17ihrVCA++S0UB7bgYu P6YwfcKORkuRM32UzARBGCFgfobW5dVwgR63DxUaC+rRQAPZQjU1zTSeWV/Tq3WHA/hK xtwVQOtEdGgJ6bFF728Tn6U3Gp6TWzSlSII1MmVMWi8Xlb5Dx9UStsyYO5r50o0X3mCo 9r6Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=sjPbEARM; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id s16-v6si23846821pfk.62.2018.10.18.21.30.15; Thu, 18 Oct 2018 21:30:33 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=sjPbEARM; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726837AbeJSMeL (ORCPT + 99 others); Fri, 19 Oct 2018 08:34:11 -0400 Received: from mail.kernel.org ([198.145.29.99]:57282 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726424AbeJSMeL (ORCPT ); Fri, 19 Oct 2018 08:34:11 -0400 Received: from mail-wm1-f50.google.com (mail-wm1-f50.google.com [209.85.128.50]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 9E73821470 for ; Fri, 19 Oct 2018 04:29:52 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1539923392; bh=TUZvn65JKwqb6+T3EiH6Xe3tGdkOvIYBv3I4QnDINBc=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=sjPbEARMIH022gSOFG5VOJl//YBan62B8fFFpOpAyT3RdcEaIh+Og2ne5yboNXaYg SIAPS/zBf7dwA8wY+Pvd7O3zz0eoHi11Vug0vIBggGtS5N35UNuwXzfsvxKFqsp0fy zSZQ5SbuiA6TEde8cKzhktK4Sg8GjTWkiPqD/yL8= Received: by mail-wm1-f50.google.com with SMTP id 206-v6so2253523wmb.5 for ; Thu, 18 Oct 2018 21:29:52 -0700 (PDT) X-Gm-Message-State: ABuFfohnIR5hm8wuNRBUJMn/rif0osck2ie57UyiQk9Xyj9S3uNHSq1O bIdgqxXwu/mILT+nzthSR1HUZblhAdsB2JVotgSYhA== X-Received: by 2002:a1c:1fcd:: with SMTP id f196-v6mr3025461wmf.19.1539923391063; Thu, 18 Oct 2018 21:29:51 -0700 (PDT) MIME-Version: 1.0 References: <20181018005420.82993-1-namit@vmware.com> <20181018005420.82993-2-namit@vmware.com> <07255D2B-0243-4254-B62A-37050C44207E@vmware.com> <925F22EA-F8CB-4194-B96B-378409ED7918@vmware.com> <2626124E-7344-42F3-AD07-0BB34D62A9EE@amacapital.net> <6F1FD9DA-5E86-42A2-8EAF-05F5D70FE2EF@vmware.com> In-Reply-To: <6F1FD9DA-5E86-42A2-8EAF-05F5D70FE2EF@vmware.com> From: Andy Lutomirski Date: Thu, 18 Oct 2018 21:29:39 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix To: Nadav Amit , Alexei Starovoitov , Oleg Nesterov Cc: Ingo Molnar , Andrew Lutomirski , Peter Zijlstra , "H. Peter Anvin" , Thomas Gleixner , LKML , X86 ML , Borislav Petkov , "Woodhouse, David" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > On Oct 18, 2018, at 6:08 PM, Nadav Amit wrote: > > at 10:00 AM, Andy Lutomirski wrote: > >> >> >>> On Oct 18, 2018, at 9:47 AM, Nadav Amit wrote: >>> >>> at 8:51 PM, Andy Lutomirski wrote: >>> >>>>> On Wed, Oct 17, 2018 at 8:12 PM Nadav Amit wrote: >>>>> at 6:22 PM, Andy Lutomirski wrote: >>>>> >>>>>>> On Oct 17, 2018, at 5:54 PM, Nadav Amit wrote: >>>>>>> >>>>>>> It is sometimes beneficial to prevent preemption for very few >>>>>>> instructions, or prevent preemption for some instructions that prec= ede >>>>>>> a branch (this latter case will be introduced in the next patches). >>>>>>> >>>>>>> To provide such functionality on x86-64, we use an empty REX-prefix >>>>>>> (opcode 0x40) as an indication that preemption is disabled for the >>>>>>> following instruction. >>>>>> >>>>>> Nifty! >>>>>> >>>>>> That being said, I think you have a few bugs. First, you can=E2=80= =99t just ignore >>>>>> a rescheduling interrupt, as you introduce unbounded latency when th= is >>>>>> happens =E2=80=94 you=E2=80=99re effectively emulating preempt_enabl= e_no_resched(), which >>>>>> is not a drop-in replacement for preempt_enable(). To fix this, you = may >>>>>> need to jump to a slow-path trampoline that calls schedule() at the = end or >>>>>> consider rewinding one instruction instead. Or use TF, which is only= a >>>>>> little bit terrifying=E2=80=A6 >>>>> >>>>> Yes, I didn=E2=80=99t pay enough attention here. For my use-case, I t= hink that the >>>>> easiest solution would be to make synchronize_sched() ignore preempti= ons >>>>> that happen while the prefix is detected. It would slightly change th= e >>>>> meaning of the prefix. >>> >>> So thinking about it further, rewinding the instruction seems the easie= st >>> and most robust solution. I=E2=80=99ll do it. >>> >>>>>> You also aren=E2=80=99t accounting for the case where you get an exc= eption that >>>>>> is, in turn, preempted. >>>>> >>>>> Hmm.. Can you give me an example for such an exception in my use-case= ? I >>>>> cannot think of an exception that might be preempted (assuming #BP, #= MC >>>>> cannot be preempted). >>>> >>>> Look for cond_local_irq_enable(). >>> >>> I looked at it. Yet, I still don=E2=80=99t see how exceptions might hap= pen in my >>> use-case, but having said that - this can be fixed too. >> >> I=E2=80=99m not totally certain there=E2=80=99s a case that matters. Bu= t it=E2=80=99s worth checking > > I am still checking. But, I wanted to ask you whether the existing code i= s > correct, since it seems to me that others do the same mistake I did, unle= ss > I don=E2=80=99t understand the code. > > Consider for example do_int3(), and see my inlined comments: > > dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code) > { > ... > ist_enter(regs); // =3D> preempt_disable() > cond_local_irq_enable(regs); // =3D> assume it enables IRQs > > ... > // resched irq can be delivered here. It will not caused rescheduling > // since preemption is disabled > > cond_local_irq_disable(regs); // =3D> assume it disables IRQs > ist_exit(regs); // =3D> preempt_enable_no_resched() > } > > At this point resched will not happen for unbounded length of time (unles= s > there is another point when exiting the trap handler that checks if > preemption should take place). I think it's only a bug in the cases where someone uses extable to fix up an int3 (which would be nuts) or that we oops. But I should still fix it. In the normal case where int3 was in user code, we'll miss the reschedule in do_trap(), but we'll reschedule in prepare_exit_to_usermode() -> exit_to_usermode_loop(). > > Another example is __BPF_PROG_RUN_ARRAY(), which also uses > preempt_enable_no_resched(). Alexei, I think this code is just wrong. Do you know why it uses preempt_enable_no_resched()? Oleg, the code in kernel/signal.c: preempt_disable(); read_unlock(&tasklist_lock); preempt_enable_no_resched(); freezable_schedule(); looks bogus. I don't get what it's trying to achieve with preempt_disable(), and I also don't see why no_resched does anything. Sure, it prevents a reschedule triggered during read_unlock() from causing a reschedule, but it doesn't prevent an interrupt immediately after the preempt_enable_no_resched() call from scheduling. --Andy > > Am I missing something? > > Thanks, > Nadav