Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
Date:   Mon, 7 Mar 2022 16:37:51 -0800
From:   "Luck, Tony" <tony.luck@intel.com>
To:     Thomas Gleixner <tglx@linutronix.de>
Cc:     Fenghua Yu <fenghua.yu@intel.com>, x86@kernel.org,
        linux-kernel@vger.kernel.org, patches@lists.linux.dev
Subject: Re: [PATCH] x86/split_lock: Make life miserable for split lockers
Message-ID: <YialXwpbED5kAUaZ@agluck-desk3.sc.intel.com>
References: <20220217012721.9694-1-tony.luck@intel.com>
 <877d95l7jo.ffs@tglx>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <877d95l7jo.ffs@tglx>
Precedence: bulk

On Mon, Mar 07, 2022 at 11:30:35PM +0100, Thomas Gleixner wrote:
> Tony,
> 
> On Wed, Feb 16 2022 at 17:27, Tony Luck wrote:
> > Questions for this RFC:
> >
> > 1) Does this need to be a new option? Maybe just update the
> >    existing "warn" mode to add this level of extra pain.
> 
> That's fine. Warn is the default today, right?

Yes. Warn is the current default.
Does "That's fine" mean ok to change exiting warn code to add
this level of pain? Or OK to add a new option?

> > 2) Under what circumstances will work a function scheduled with
> >    schedule_delayed_work() run on different CPU?
> 
> Under many...
> 
> >    I've covered the obvious case of the CPU being taken offline before
> >    the work is run. But are there other cases?
> 
> scheduled_delayed_work_on() is what you are looking for.

That sounds like the right choice ... I just didn't dig deep
enough into the options available.

> > 3) Should I add even more pain with an msleep() before even trying
> >    to get the semaphore?
> 
> No objections from me.

Will do in next version.

> > +static void __split_lock_reenable(struct work_struct *work)
> > +{
> > +	sld_update_msr(true);
> > +	up(&buslock_sem);
> > +}
> > +
> > +/*
> > + * If a CPU goes offline with pending delayed work to
> > + * re-enable split lock detection then the delayed work
> > + * will be executed on some other CPU. That handles releasing
> > + * the buslock_sem, but because it executes on a different
> > + * CPU probably won't re-enable split lock detection. This
> > + * is a problem on HT systems since the sibling CPU on the
> > + * same core may then be left running with split lock
> > + * detection disabled.
> > + *
> > + * Unconditionally re-enable detection here.
> 
> Had to think twice whether this works under all circumstances. It
> actually works because of how CPU hotunplug works nowadays. It
> guarantees that after the initial CPU down state sched_cpu_deactivate()
> no task which is running or affine to the CPU can get back to user space
> on that CPU. That was not always the case, that's why I had to think
> twice :)
> 
> But I'm not yet convinced that this is required at all.
> 
> > + */
> > +static int splitlock_cpu_offline(unsigned int cpu)
> > +{
> > +	sld_update_msr(true);
> > +
> > +	return 0;
> > +}
> > +
> > +static DECLARE_DELAYED_WORK(split_lock_reenable, __split_lock_reenable);
> > +
> >  static void split_lock_warn(unsigned long ip)
> >  {
> >  	pr_warn_ratelimited("#AC: %s/%d took a split_lock trap at address: 0x%lx\n",
> >  			    current->comm, current->pid, ip);
> >  
> > -	/*
> > -	 * Disable the split lock detection for this task so it can make
> > -	 * progress and set TIF_SLD so the detection is re-enabled via
> > -	 * switch_to_sld() when the task is scheduled out.
> > -	 */
> > +	switch (sld_state) {
> > +	case sld_warn:
> > +		/* This task will keep running with split lock disabled */
> > +		set_tsk_thread_flag(current, TIF_SLD);
> > +		break;
> > +	case sld_sequential:
> > +		/* Only allow one buslocked disabled core at a time */
> > +		if (down_interruptible(&buslock_sem) == -EINTR)
> > +			return;
> > +		schedule_delayed_work(&split_lock_reenable, 2);
> 
> Hmm. This does not set TIF_SLD. So:
> 
>  task hits splitlock
>    #AC
>      down(sema);
>      schedule_work();
>      disable_sld();
> 
>  task is preempted or schedules out voluntarily
> 
>    -> SLD stays disabled for the incoming task which is wrong and it
>       stays disabled up to the point where the timer fires or a task
>       switch with TIF_SLD mismatch happens. 
> 
> Not what we want, right?

It isn't ideal. We just gave a free pass to some other tasks to
do split locks for up to two jiffies. But they would have been
given those two jiffies later had they taken an #AC trap ... so
they just bypassed the queue to get what we would have given them
later.

> So the right thing to do is to set TIF_SLD also for the sequential
> case. Now how to do that delayed split lock reenable for the task in
> question?
> 
> 	case sld_sequential:
> 		if (down_interruptible(&buslock_sem) == -EINTR)
> 			return;
> 		set_tsk_thread_flag(current, TIF_SLD);
>                 buslock_sequential_task = current;
>                 get_task_struct(current);
> 		schedule_delayed_work_on(smp_processor_id(), &split_lock_reenable, 2);
> 
> and then the work function does:
> 
>     clear_tsk_thread_flag(buslock_sequential_task, TIF_SLD);
>     put_task_struct(buslock_sequential_task);
>     buslock_sequential_task = NULL;
>     up(&buslock_sem);
>     
> With that you spare the cpu hotplug callback as well simply because it's
> guaranteed that the SLD state is handled correctly when the task in
> question schedules out. I.e. it does not matter at all on which CPU the
> timer goes off if the CPU on which is was armed is offlined before it
> fires.
> 
> But that's nasty too because if the task schedules away from the CPU on
> which it hit the buslock in the first place and then stays on the other
> CPU in user space forever (think NOHZ_FULL) then it can buslock forever
> too.

Agreed. Trying to get this "perfect" has many ugly corner cases.

> The question is whether this is something to worry about. If so, then we
> need to go back to the drawing board.

I don't think it is worth worrying about. The case you describe is
a process that is about to be preempted when the #AC trap happens.
In that case this CPU (in fact both HT threads on this core) get
two jiffies of free split locks.  Cases from here:

1) The original process gets to run on either of these threads
before the timeout. They get to execute their split lock and carry
on running.

2) The process is scheduled on a different core during the two jiffie
window. They take an #AC trap and block on the semaphore until the
original core releases. Then they get their chance to run on this new
core.

3) The original process doesn't get rescheduled for two jiffies, then
runs somewhere. The original core has released the sempahore and re-enabled
split lock checking. So the process takes #AC, gets the semaphore, kernel
disables split lock checking ... and we try again.

Now it is possible that the process may repeatedly be preempted in between
getting the semaphore and actually getting all the way to user space
to split a lock ... but can only happen if there are multiple processes
splitting locks. The goal of this patch is to be mean to all of them. If
we happen to be extra mean to some of them, well so be it.

-Tony