Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
Date:   Tue, 28 Mar 2023 14:42:54 -0700
From:   Beau Belgrave <beaub@linux.microsoft.com>
To:     Steven Rostedt <rostedt@goodmis.org>
Cc:     mhiramat@kernel.org, mathieu.desnoyers@efficios.com,
        dcook@linux.microsoft.com, alanau@linux.microsoft.com,
        brauner@kernel.org, akpm@linux-foundation.org,
        ebiederm@xmission.com, keescook@chromium.org, tglx@linutronix.de,
        linux-trace-devel@vger.kernel.org, linux-kernel@vger.kernel.org,
        linux-mm@kvack.org
Subject: Re: [PATCH v8 04/11] tracing/user_events: Fixup enable faults asyncly
Message-ID: <20230328214254.GA85@W11-BEAU-MD.localdomain>
References: <20230221211143.574-1-beaub@linux.microsoft.com>
 <20230221211143.574-5-beaub@linux.microsoft.com>
 <20230328172049.10061257@gandalf.local.home>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20230328172049.10061257@gandalf.local.home>
Precedence: bulk

On Tue, Mar 28, 2023 at 05:20:49PM -0400, Steven Rostedt wrote:
> On Tue, 21 Feb 2023 13:11:36 -0800
> Beau Belgrave <beaub@linux.microsoft.com> wrote:
> 
> > @@ -263,7 +277,85 @@ static int user_event_mm_fault_in(struct user_event_mm *mm, unsigned long uaddr)
> >  }
> >  
> >  static int user_event_enabler_write(struct user_event_mm *mm,
> > -				    struct user_event_enabler *enabler)
> > +				    struct user_event_enabler *enabler,
> > +				    bool fixup_fault);
> > +
> > +static void user_event_enabler_fault_fixup(struct work_struct *work)
> > +{
> > +	struct user_event_enabler_fault *fault = container_of(
> > +		work, struct user_event_enabler_fault, work);
> > +	struct user_event_enabler *enabler = fault->enabler;
> > +	struct user_event_mm *mm = fault->mm;
> > +	unsigned long uaddr = enabler->addr;
> > +	int ret;
> > +
> > +	ret = user_event_mm_fault_in(mm, uaddr);
> > +
> > +	if (ret && ret != -ENOENT) {
> > +		struct user_event *user = enabler->event;
> > +
> > +		pr_warn("user_events: Fault for mm: 0x%pK @ 0x%llx event: %s\n",
> > +			mm->mm, (unsigned long long)uaddr, EVENT_NAME(user));
> > +	}
> > +
> > +	/* Prevent state changes from racing */
> > +	mutex_lock(&event_mutex);
> > +
> > +	/*
> > +	 * If we managed to get the page, re-issue the write. We do not
> > +	 * want to get into a possible infinite loop, which is why we only
> > +	 * attempt again directly if the page came in. If we couldn't get
> > +	 * the page here, then we will try again the next time the event is
> > +	 * enabled/disabled.
> > +	 */
> 
> What case would we not get the page? A bad page mapping? User space doing
> something silly?
> 

A user space program unmapping the page is the most common I can think
of. A silly action would be unmapping the page while forgetting to call
the unregister IOCTL. We would then possibly see this if the event was
enabled in perf/ftrace before the process exited (and the mm getting
cleaned up).

> Or something else, for which how can it go into an infinite loop? Can that
> only happen if userspace is doing something mischievous?
> 

I'm not sure if changing page permissions on the user side would prevent
write permitted mapping in the kernel, but I wanted to ensure if that
type of thing did occur, we wouldn't loop forever. The code lets the mm
decide if a page is ever coming in via fixup_user_fault() with 
FAULT_FLAG_WRITE | FAULT_FLAG_REMOTE set.

My understanding is that fixup_user_fault() will retry to get the page
up until it's decided it's either capable of coming in or not. It will
do this since we pass the unlocked bool as a parameter. I used
fixup_user_fault() since it was created for the futex code to handle
this scenario better.

From what I gather, the fault in should only fail for these reasons:
#define VM_FAULT_ERROR (VM_FAULT_OOM | VM_FAULT_SIGBUS |	\
			VM_FAULT_SIGSEGV | VM_FAULT_HWPOISON |	\
			VM_FAULT_HWPOISON_LARGE | VM_FAULT_FALLBACK)

If these are hit, I don't believe we want to retry as they aren't likely
to ever get corrected.

Thanks,
-Beau

> -- Steve
> 
> 
> > +	clear_bit(ENABLE_VAL_FAULTING_BIT, ENABLE_BITOPS(enabler));
> > +
> > +	if (!ret) {
> > +		mmap_read_lock(mm->mm);
> > +		user_event_enabler_write(mm, enabler, true);
> > +		mmap_read_unlock(mm->mm);
> > +	}
> > +
> > +	mutex_unlock(&event_mutex);
> > +
> > +	/* In all cases we no longer need the mm or fault */
> > +	user_event_mm_put(mm);
> > +	kmem_cache_free(fault_cache, fault);
> > +}
> > +
> > +static bool user_event_enabler_queue_fault(struct user_event_mm *mm,
> > +					   struct user_event_enabler *enabler)
> > +{
> > +	struct user_event_enabler_fault *fault;
> > +
> > +	fault = kmem_cache_zalloc(fault_cache, GFP_NOWAIT | __GFP_NOWARN);
> > +
> > +	if (!fault)
> > +		return false;
> > +
> > +	INIT_WORK(&fault->work, user_event_enabler_fault_fixup);
> > +	fault->mm = user_event_mm_get(mm);
> > +	fault->enabler = enabler;
> > +
> > +	/* Don't try to queue in again while we have a pending fault */
> > +	set_bit(ENABLE_VAL_FAULTING_BIT, ENABLE_BITOPS(enabler));
> > +
> > +	if (!schedule_work(&fault->work)) {
> > +		/* Allow another attempt later */
> > +		clear_bit(ENABLE_VAL_FAULTING_BIT, ENABLE_BITOPS(enabler));
> > +
> > +		user_event_mm_put(mm);
> > +		kmem_cache_free(fault_cache, fault);
> > +
> > +		return false;
> > +	}
> > +
> > +	return true;
> > +}
> > +