2017-11-03 17:08:01

by Chris Metcalf

[permalink] [raw]
Subject: [PATCH v16 00/13] support "task_isolation" mode

Here, finally, is a new spin of the task isolation work (v16), with
changes based on the issues that were raised at last year's Linux
Plumbers Conference and in the email discussion that followed.

This version of the patch series cleans up a number of areas that were
a little dodgy in the previous patch series.

- It no longer loops in the final code that prepares to return to
userspace; instead, it sets things up in the prctl() and then
validates when preparing to return to userspace, adjusting the
syscall return value to -EAGAIN at that point if something doesn't
line up quite correctly.

- We no longer support the NOSIG mode that let you freely call into
the kernel multiple times while in task isolation. This was always
a little odd, since you really should be in sufficient control of
task isolation code that you can explicitly stop isolation with a
"prctl(PR_TASK_ISOLATION, 0)" before using the kernel for anything
else. It also made the semantics of migrating the task confusing.
More importantly, removing that support means that the only path
that sets up task isolation is the return from prctl(), which allows
us to make the simplification above.

- We no longer try to signal the task isolation process from a remote
core when we detect that we are about to violate its isolation.
Instead, we just print a message there (and optionally dump stack),
and rely on the eventual interrupt on the core itself to trigger the
signal. We are always in a safe context to generate a signal when
we enter the kernel, unlike when we are deep in a call stack sending
an SMP IPI or whatever.

- We notice the case of an unstable scheduler clock and return
EINVAL rather than spinning forever with EAGAIN (suggestion from
Francis Giraldeau).

- The prctl() call requires zeros for arg3/4/5 (suggestion from
Eugene Syromiatnikov).

The kernel internal isolation API is also now cleaner, and I have
included kerneldoc APIs for all the interfaces so that it should be
easier to port it to additional architectures; in fact looking at
include/linux/isolation.h is a good place to start understanding the
overall patch set.

I removed Catalin's Reviewed-by for arm64, and Christoph's Tested-by
for x86, since this version is sufficiently different to merit
re-review and re-testing.

Note that this is not rebased on top of Frederic's recent housekeeping
patch series, although it is largely orthogonal right now. After
Frederic's patch series lands, task isolation is enabled with
"isolcpus=nohz,domain,CPUS". We could add some shorthand for that
("isolcpus=full,CPUS"?) or just use it as-is.

The previous (v15) patch series is here:

https://lkml.kernel.org/r/[email protected]

This patch series is available at:

git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane

Some folks raised some good points at the LPC discussion and then in
email discussions that followed. Rather than trying to respond to
everyone in a flurry of emails, I'll answer some questions here:


Why not just instrument user_exit() to raise the isolation-lost signal?

Andy pointed me in this direction. The advantage is that you catch
*everything*, by definition. There is a hook that can do this in the
current patch set, but you have to #define DEBUG_TASK_ISOLATION
manually to take advantage of it, because as written it has two issues:

1. You can't actually exit the kernel with prctl(PR_TASK_ISOLATION,0)
because the user_exit hook kills you first.
2. You lose the ability to get much better diagnostics by waiting
until you are further into kernel entry and know what you're doing.

You could work around #2 in several ways, but #1 is harder. I looked
at x86 for a while, and although you could imagine this, you really
want to generate a lost-isolation signal on any syscall that isn't
that exact prctl(), and it's awkward to try to do all of that checking
before user_exit(). Since in any case we do want to have the more
specific probes at the various kernel entry points where we generate
the diagnostics, I felt like it wasn't the right approach to enable
as a compilation-time default. I'm open to discussion on this though!


Can't we do all the exit-to-userspace work with irqs disabled?

In fact, it turns out that you can do lru_add_drain() with irqs
disabled, so that's what we're doing in the patch series now.

However, it doesn't seem possible to do the synchronous cancellation of
the vmstat deferred work with irqs disabled, though if there's a way,
it would be a little cleaner to do that; Christoph? We can certainly
update the statistics with interrupts disabled via
refresh_cpu_vm_stats(false), but that's not sufficient. For now, I
just issue the cancellation during sys_prctl() call, and then if it
isn't synchronized by the time we are exiting to userspace, we just
jam in an EAGAIN and let userspace retry. In practice, this doesn't
seem to ever happen.


What about using a per-cpu flag to stop doing new deferred work?

Andy also suggested we could structure the code to have the prctl()
set a per-cpu flag to stop adding new future work (e.g. vmstat per-cpu
data, or lru page cache). Then, we could flush those structures right
from the sys_prctl() call, and when we were returning to user space,
we'd be confident that there wasn't going to be any new work added.

With the current set of things that we are disabling for task
isolation, though, it didn't seem necessary. Quiescing the vmstat
shepherd seems like it is generally pretty safe since we will likely
be able to sync up the per-cpu cache and kill the deferred work with
high probability, with no expectation that additional work will show
up. And since we can flush the LRU page cache with interrupts
disabled, that turns out not to be an issue either.

I could imagine that if we have to deal with some new kind of deferred
work, we might find the per-cpu flag becomes a good solution, but for
now we don't have a good use case for that approach.


How about stopping the dyn tick?

Right now we try to stop it on return to userspace, but if we can't,
we just return EAGAIN to userspace. In practice, what I see is that
usually the tick stops immediately, but occasionally it doesn't; in
this case I've always seen that nr_running is >1, presumably with some
temporary kernel worker threads, and the user code just needs to call
prctl() until those threads are done. We could structure things with
a completion that we wait for, which is set by the timer code when it
finally does stop the tick, but this may be overkill, particularly
since we'll only be running this prctl() loop from userspace on cores
where we have no other useful work that we're trying to run anyway.


What about TLB flushing?

We talked about this at Plumbers and some of the email discussion also
was about TLB flushing. I haven't tried to add it to this patch set,
because I really want to avoid scope creep; in any case, I think I
managed to convince Andy that he was going to work on it himself. :)
Paul McKenney already contributed some framework for such a patch, in
commit b8c17e6664c4 ("rcu: Maintain special bits at bottom of
->dynticks counter").

What about that d*mn 1 Hz clock?

It's still there, so this code still requires some further work before
it can actually get a process into long-term task isolation (without
the obvious one-line kernel hack). Frederic suggested a while ago
forcing updates on cpustats was required as the last gating factor; do
we think that is still true? Christoph was working on this at one
point - any progress from your point of view?


Chris Metcalf (12):
vmstat: add quiet_vmstat_sync function
vmstat: add vmstat_idle function
Revert "sched/core: Drop the unused try_get_task_struct() helper
function"
Add try_get_task_struct_on_cpu() to scheduler for task isolation
Add try_stop_full_tick() API for NO_HZ_FULL
task_isolation: userspace hard isolation from kernel
Add task isolation hooks to arch-independent code
arch/x86: enable task isolation functionality
arch/arm64: enable task isolation functionality
arch/tile: enable task isolation functionality
arm, tile: turn off timer tick for oneshot_stopped state
task_isolation self test

Francis Giraldeau (1):
arch/arm: enable task isolation functionality

Documentation/admin-guide/kernel-parameters.txt | 6 +
arch/arm/Kconfig | 1 +
arch/arm/include/asm/thread_info.h | 10 +-
arch/arm/kernel/entry-common.S | 12 +-
arch/arm/kernel/ptrace.c | 10 +
arch/arm/kernel/signal.c | 10 +-
arch/arm/kernel/smp.c | 4 +
arch/arm/mm/fault.c | 8 +-
arch/arm64/Kconfig | 1 +
arch/arm64/include/asm/thread_info.h | 5 +-
arch/arm64/kernel/ptrace.c | 18 +-
arch/arm64/kernel/signal.c | 10 +-
arch/arm64/kernel/smp.c | 7 +
arch/arm64/mm/fault.c | 5 +
arch/tile/Kconfig | 1 +
arch/tile/include/asm/thread_info.h | 2 +
arch/tile/kernel/hardwall.c | 2 +
arch/tile/kernel/irq.c | 3 +
arch/tile/kernel/messaging.c | 4 +
arch/tile/kernel/process.c | 4 +
arch/tile/kernel/ptrace.c | 10 +
arch/tile/kernel/single_step.c | 7 +
arch/tile/kernel/smp.c | 21 +-
arch/tile/kernel/time.c | 3 +
arch/tile/kernel/unaligned.c | 4 +
arch/tile/mm/fault.c | 13 +-
arch/tile/mm/homecache.c | 11 +
arch/x86/Kconfig | 1 +
arch/x86/entry/common.c | 14 +
arch/x86/include/asm/apic.h | 3 +
arch/x86/include/asm/thread_info.h | 8 +-
arch/x86/kernel/smp.c | 2 +
arch/x86/kernel/traps.c | 3 +
arch/x86/mm/fault.c | 5 +
drivers/clocksource/arm_arch_timer.c | 2 +
include/linux/isolation.h | 175 ++++++
include/linux/sched.h | 4 +
include/linux/sched/task.h | 3 +
include/linux/tick.h | 1 +
include/linux/vmstat.h | 4 +
include/uapi/linux/prctl.h | 6 +
init/Kconfig | 28 +
kernel/Makefile | 1 +
kernel/context_tracking.c | 2 +
kernel/exit.c | 13 +
kernel/irq/irqdesc.c | 5 +
kernel/irq_work.c | 5 +-
kernel/isolation.c | 401 +++++++++++++
kernel/sched/core.c | 11 +
kernel/signal.c | 2 +
kernel/smp.c | 6 +-
kernel/sys.c | 6 +
kernel/time/tick-sched.c | 18 +
mm/vmstat.c | 19 +
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/task_isolation/Makefile | 6 +
tools/testing/selftests/task_isolation/config | 1 +
tools/testing/selftests/task_isolation/isolation.c | 643 +++++++++++++++++++++
58 files changed, 1561 insertions(+), 30 deletions(-)
create mode 100644 include/linux/isolation.h
create mode 100644 kernel/isolation.c
create mode 100644 tools/testing/selftests/task_isolation/Makefile
create mode 100644 tools/testing/selftests/task_isolation/config
create mode 100644 tools/testing/selftests/task_isolation/isolation.c

--
2.1.2


From 1583485111819173592@xxx Wed Nov 08 08:18:01 +0000 2017
X-GM-THRID: 1583419585787053100
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread


2017-11-03 17:10:26

by Chris Metcalf

[permalink] [raw]
Subject: [PATCH v16 10/13] arch/arm: enable task isolation functionality

From: Francis Giraldeau <[email protected]>

This patch is a port of the task isolation functionality to the arm 32-bit
architecture. The task isolation needs an additional thread flag that
requires to change the entry assembly code to accept a bitfield larger than
one byte. The constants _TIF_SYSCALL_WORK and _TIF_WORK_MASK are now
defined in the literal pool. The rest of the patch is straightforward and
reflects what is done on other architectures.

To avoid problems with the tst instruction in the v7m build, we renumber
TIF_SECCOMP to bit 8 and let TIF_TASK_ISOLATION use bit 7.

Signed-off-by: Francis Giraldeau <[email protected]>
Signed-off-by: Chris Metcalf <[email protected]> [with modifications]
---
arch/arm/Kconfig | 1 +
arch/arm/include/asm/thread_info.h | 10 +++++++---
arch/arm/kernel/entry-common.S | 12 ++++++++----
arch/arm/kernel/ptrace.c | 10 ++++++++++
arch/arm/kernel/signal.c | 10 +++++++++-
arch/arm/kernel/smp.c | 4 ++++
arch/arm/mm/fault.c | 8 +++++++-
7 files changed, 46 insertions(+), 9 deletions(-)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 7888c9803eb0..3423c655a32b 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -48,6 +48,7 @@ config ARM
select HAVE_ARCH_KGDB if !CPU_ENDIAN_BE32 && MMU
select HAVE_ARCH_MMAP_RND_BITS if MMU
select HAVE_ARCH_SECCOMP_FILTER if (AEABI && !OABI_COMPAT)
+ select HAVE_ARCH_TASK_ISOLATION
select HAVE_ARCH_TRACEHOOK
select HAVE_ARM_SMCCC if CPU_V7
select HAVE_EBPF_JIT if !CPU_ENDIAN_BE32
diff --git a/arch/arm/include/asm/thread_info.h b/arch/arm/include/asm/thread_info.h
index 776757d1604a..a7b76ac9543d 100644
--- a/arch/arm/include/asm/thread_info.h
+++ b/arch/arm/include/asm/thread_info.h
@@ -142,7 +142,8 @@ extern int vfp_restore_user_hwstate(struct user_vfp __user *,
#define TIF_SYSCALL_TRACE 4 /* syscall trace active */
#define TIF_SYSCALL_AUDIT 5 /* syscall auditing active */
#define TIF_SYSCALL_TRACEPOINT 6 /* syscall tracepoint instrumentation */
-#define TIF_SECCOMP 7 /* seccomp syscall filtering active */
+#define TIF_TASK_ISOLATION 7 /* task isolation active */
+#define TIF_SECCOMP 8 /* seccomp syscall filtering active */

#define TIF_NOHZ 12 /* in adaptive nohz mode */
#define TIF_USING_IWMMXT 17
@@ -156,18 +157,21 @@ extern int vfp_restore_user_hwstate(struct user_vfp __user *,
#define _TIF_SYSCALL_TRACE (1 << TIF_SYSCALL_TRACE)
#define _TIF_SYSCALL_AUDIT (1 << TIF_SYSCALL_AUDIT)
#define _TIF_SYSCALL_TRACEPOINT (1 << TIF_SYSCALL_TRACEPOINT)
+#define _TIF_TASK_ISOLATION (1 << TIF_TASK_ISOLATION)
#define _TIF_SECCOMP (1 << TIF_SECCOMP)
#define _TIF_USING_IWMMXT (1 << TIF_USING_IWMMXT)

/* Checks for any syscall work in entry-common.S */
#define _TIF_SYSCALL_WORK (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \
- _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP)
+ _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP | \
+ _TIF_TASK_ISOLATION)

/*
* Change these and you break ASM code in entry-common.S
*/
#define _TIF_WORK_MASK (_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
- _TIF_NOTIFY_RESUME | _TIF_UPROBE)
+ _TIF_NOTIFY_RESUME | _TIF_UPROBE | \
+ _TIF_TASK_ISOLATION)

#endif /* __KERNEL__ */
#endif /* __ASM_ARM_THREAD_INFO_H */
diff --git a/arch/arm/kernel/entry-common.S b/arch/arm/kernel/entry-common.S
index 99c908226065..9ae3ef2dbc1e 100644
--- a/arch/arm/kernel/entry-common.S
+++ b/arch/arm/kernel/entry-common.S
@@ -53,7 +53,8 @@ ret_fast_syscall:
cmp r2, #TASK_SIZE
blne addr_limit_check_failed
ldr r1, [tsk, #TI_FLAGS] @ re-check for syscall tracing
- tst r1, #_TIF_SYSCALL_WORK | _TIF_WORK_MASK
+ ldr r2, =_TIF_SYSCALL_WORK | _TIF_WORK_MASK
+ tst r1, r2
bne fast_work_pending


@@ -83,7 +84,8 @@ ret_fast_syscall:
cmp r2, #TASK_SIZE
blne addr_limit_check_failed
ldr r1, [tsk, #TI_FLAGS] @ re-check for syscall tracing
- tst r1, #_TIF_SYSCALL_WORK | _TIF_WORK_MASK
+ ldr r2, =_TIF_SYSCALL_WORK | _TIF_WORK_MASK
+ tst r1, r2
beq no_work_pending
UNWIND(.fnend )
ENDPROC(ret_fast_syscall)
@@ -91,7 +93,8 @@ ENDPROC(ret_fast_syscall)
/* Slower path - fall through to work_pending */
#endif

- tst r1, #_TIF_SYSCALL_WORK
+ ldr r2, =_TIF_SYSCALL_WORK
+ tst r1, r2
bne __sys_trace_return_nosave
slow_work_pending:
mov r0, sp @ 'regs'
@@ -238,7 +241,8 @@ local_restart:
ldr r10, [tsk, #TI_FLAGS] @ check for syscall tracing
stmdb sp!, {r4, r5} @ push fifth and sixth args

- tst r10, #_TIF_SYSCALL_WORK @ are we tracing syscalls?
+ ldr r11, =_TIF_SYSCALL_WORK @ are we tracing syscalls?
+ tst r10, r11
bne __sys_trace

cmp scno, #NR_syscalls @ check upper syscall limit
diff --git a/arch/arm/kernel/ptrace.c b/arch/arm/kernel/ptrace.c
index 58e3771e4c5b..0cfcba5a93df 100644
--- a/arch/arm/kernel/ptrace.c
+++ b/arch/arm/kernel/ptrace.c
@@ -27,6 +27,7 @@
#include <linux/audit.h>
#include <linux/tracehook.h>
#include <linux/unistd.h>
+#include <linux/isolation.h>

#include <asm/pgtable.h>
#include <asm/traps.h>
@@ -936,6 +937,15 @@ asmlinkage int syscall_trace_enter(struct pt_regs *regs, int scno)
if (test_thread_flag(TIF_SYSCALL_TRACE))
tracehook_report_syscall(regs, PTRACE_SYSCALL_ENTER);

+ /*
+ * In task isolation mode, we may prevent the syscall from
+ * running, and if so we also deliver a signal to the process.
+ */
+ if (test_thread_flag(TIF_TASK_ISOLATION)) {
+ if (task_isolation_syscall(scno) == -1)
+ return -1;
+ }
+
/* Do seccomp after ptrace; syscall may have changed. */
#ifdef CONFIG_HAVE_ARCH_SECCOMP_FILTER
if (secure_computing(NULL) == -1)
diff --git a/arch/arm/kernel/signal.c b/arch/arm/kernel/signal.c
index b67ae12503f3..7c526efb301a 100644
--- a/arch/arm/kernel/signal.c
+++ b/arch/arm/kernel/signal.c
@@ -15,6 +15,7 @@
#include <linux/tracehook.h>
#include <linux/uprobes.h>
#include <linux/syscalls.h>
+#include <linux/isolation.h>

#include <asm/elf.h>
#include <asm/cacheflush.h>
@@ -605,6 +606,9 @@ static int do_signal(struct pt_regs *regs, int syscall)
return 0;
}

+#define WORK_PENDING_LOOP_FLAGS \
+ (_TIF_NEED_RESCHED | _TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE)
+
asmlinkage int
do_work_pending(struct pt_regs *regs, unsigned int thread_flags, int syscall)
{
@@ -641,7 +645,11 @@ do_work_pending(struct pt_regs *regs, unsigned int thread_flags, int syscall)
}
local_irq_disable();
thread_flags = current_thread_info()->flags;
- } while (thread_flags & _TIF_WORK_MASK);
+ } while (thread_flags & WORK_PENDING_LOOP_FLAGS);
+
+ if (thread_flags & _TIF_TASK_ISOLATION)
+ task_isolation_start();
+
return 0;
}

diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
index c9a0a5299827..76f8b2010ddf 100644
--- a/arch/arm/kernel/smp.c
+++ b/arch/arm/kernel/smp.c
@@ -29,6 +29,7 @@
#include <linux/completion.h>
#include <linux/cpufreq.h>
#include <linux/irq_work.h>
+#include <linux/isolation.h>

#include <linux/atomic.h>
#include <asm/smp.h>
@@ -525,6 +526,7 @@ void arch_send_call_function_ipi_mask(const struct cpumask *mask)

void arch_send_wakeup_ipi_mask(const struct cpumask *mask)
{
+ task_isolation_remote_cpumask(mask, "wakeup IPI");
smp_cross_call(mask, IPI_WAKEUP);
}

@@ -544,6 +546,7 @@ void arch_irq_work_raise(void)
#ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
void tick_broadcast(const struct cpumask *mask)
{
+ task_isolation_remote_cpumask(mask, "timer IPI");
smp_cross_call(mask, IPI_TIMER);
}
#endif
@@ -665,6 +668,7 @@ void handle_IPI(int ipinr, struct pt_regs *regs)

void smp_send_reschedule(int cpu)
{
+ task_isolation_remote(cpu, "reschedule IPI");
smp_cross_call(cpumask_of(cpu), IPI_RESCHEDULE);
}

diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index 42f585379e19..052860948771 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -20,6 +20,7 @@
#include <linux/sched/debug.h>
#include <linux/highmem.h>
#include <linux/perf_event.h>
+#include <linux/isolation.h>

#include <asm/exception.h>
#include <asm/pgtable.h>
@@ -352,8 +353,13 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
/*
* Handle the "normal" case first - VM_FAULT_MAJOR
*/
- if (likely(!(fault & (VM_FAULT_ERROR | VM_FAULT_BADMAP | VM_FAULT_BADACCESS))))
+ if (likely(!(fault & (VM_FAULT_ERROR | VM_FAULT_BADMAP |
+ VM_FAULT_BADACCESS)))) {
+ /* No signal was generated, but notify task-isolation tasks. */
+ if (user_mode(regs))
+ task_isolation_interrupt("page fault at %#lx", addr);
return 0;
+ }

/*
* If we are in kernel mode at this point, we
--
2.1.2


From 1583364621409200386@xxx Tue Nov 07 00:22:52 +0000 2017
X-GM-THRID: 1583364621409200386
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread

2017-11-03 17:10:18

by Chris Metcalf

[permalink] [raw]
Subject: [PATCH v16 06/13] task_isolation: userspace hard isolation from kernel

The existing nohz_full mode is designed as a "soft" isolation mode
that makes tradeoffs to minimize userspace interruptions while
still attempting to avoid overheads in the kernel entry/exit path,
to provide 100% kernel semantics, etc.

However, some applications require a "hard" commitment from the
kernel to avoid interruptions, in particular userspace device driver
style applications, such as high-speed networking code.

This change introduces a framework to allow applications
to elect to have the "hard" semantics as needed, specifying
prctl(PR_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so.

The kernel must be built with the new TASK_ISOLATION Kconfig flag
to enable this mode, and the kernel booted with an appropriate
"nohz_full=CPULIST isolcpus=CPULIST" boot argument to enable
nohz_full and isolcpus. The "task_isolation" state is then indicated
by setting a new task struct field, task_isolation_flag, to the
value passed by prctl(), and also setting a TIF_TASK_ISOLATION
bit in the thread_info flags. When the kernel is returning to
userspace from the prctl() call and sees TIF_TASK_ISOLATION set,
it calls the new task_isolation_start() routine to arrange for
the task to avoid being interrupted in the future.

With interrupts disabled, task_isolation_start() ensures that kernel
subsystems that might cause a future interrupt are quiesced. If it
doesn't succeed, it adjusts the syscall return value to indicate that
fact, and userspace can retry as desired. In addition to stopping
the scheduler tick, the code takes any actions that might avoid
a future interrupt to the core, such as a worker thread being
scheduled that could be quiesced now (e.g. the vmstat worker)
or a future IPI to the core to clean up some state that could be
cleaned up now (e.g. the mm lru per-cpu cache).

Once the task has returned to userspace after issuing the prctl(),
if it enters the kernel again via system call, page fault, or any
other exception or irq, the kernel will kill it with SIGKILL.
In addition to sending a signal, the code supports a kernel
command-line "task_isolation_debug" flag which causes a stack
backtrace to be generated whenever a task loses isolation.

To allow the state to be entered and exited, the syscall checking
test ignores the prctl(PR_TASK_ISOLATION) syscall so that we can
clear the bit again later, and ignores exit/exit_group to allow
exiting the task without a pointless signal being delivered.

The prctl() API allows for specifying a signal number to use instead
of the default SIGKILL, to allow for catching the notification
signal; for example, in a production environment, it might be
helpful to log information to the application logging mechanism
before exiting. Or, the signal handler might choose to reset the
program counter back to the code segment intended to be run isolated
via prctl() to continue execution.

In a number of cases we can tell on a remote cpu that we are
going to be interrupting the cpu, e.g. via an IPI or a TLB flush.
In that case we generate the diagnostic (and optional stack dump)
on the remote core to be able to deliver better diagnostics.
If the interrupt is not something caught by Linux (e.g. a
hypervisor interrupt) we can also request a reschedule IPI to
be sent to the remote core so it can be sure to generate a
signal to notify the process.

Separate patches that follow provide these changes for x86, tile,
arm, and arm64.

Signed-off-by: Chris Metcalf <[email protected]>
---
Documentation/admin-guide/kernel-parameters.txt | 6 +
include/linux/isolation.h | 175 +++++++++++
include/linux/sched.h | 4 +
include/uapi/linux/prctl.h | 6 +
init/Kconfig | 28 ++
kernel/Makefile | 1 +
kernel/context_tracking.c | 2 +
kernel/isolation.c | 402 ++++++++++++++++++++++++
kernel/signal.c | 2 +
kernel/sys.c | 6 +
10 files changed, 631 insertions(+)
create mode 100644 include/linux/isolation.h
create mode 100644 kernel/isolation.c

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 05496622b4ef..aaf278f2cfc3 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4025,6 +4025,12 @@
neutralize any effect of /proc/sys/kernel/sysrq.
Useful for debugging.

+ task_isolation_debug [KNL]
+ In kernels built with CONFIG_TASK_ISOLATION, this
+ setting will generate console backtraces to
+ accompany the diagnostics generated about
+ interrupting tasks running with task isolation.
+
tcpmhash_entries= [KNL,NET]
Set the number of tcp_metrics_hash slots.
Default value is 8192 or 16384 depending on total
diff --git a/include/linux/isolation.h b/include/linux/isolation.h
new file mode 100644
index 000000000000..8189a772affd
--- /dev/null
+++ b/include/linux/isolation.h
@@ -0,0 +1,175 @@
+/*
+ * Task isolation related global functions
+ */
+#ifndef _LINUX_ISOLATION_H
+#define _LINUX_ISOLATION_H
+
+#include <stdarg.h>
+#include <linux/errno.h>
+#include <linux/cpumask.h>
+#include <linux/prctl.h>
+#include <linux/types.h>
+
+struct task_struct;
+
+#ifdef CONFIG_TASK_ISOLATION
+
+/**
+ * task_isolation_request() - prctl hook to request task isolation
+ * @flags: Flags from <linux/prctl.h> PR_TASK_ISOLATION_xxx.
+ *
+ * This is called from the generic prctl() code for PR_TASK_ISOLATION.
+
+ * Return: Returns 0 when task isolation enabled, otherwise a negative
+ * errno.
+ */
+extern int task_isolation_request(unsigned int flags);
+
+/**
+ * task_isolation_start() - attempt to actually start task isolation
+ *
+ * This function should be invoked as the last thing prior to returning to
+ * user space if TIF_TASK_ISOLATION is set in the thread_info flags. It
+ * will attempt to quiesce the core and enter task-isolation mode. If it
+ * fails, it will reset the system call return value to an error code that
+ * indicates the failure mode.
+ */
+extern void task_isolation_start(void);
+
+/**
+ * task_isolation_syscall() - report a syscall from an isolated task
+ * @nr: The syscall number.
+ *
+ * This routine should be invoked at syscall entry if TIF_TASK_ISOLATION is
+ * set in the thread_info flags. It checks for valid syscalls,
+ * specifically prctl() with PR_TASK_ISOLATION, exit(), and exit_group().
+ * For any other syscall it will raise a signal and return failure.
+ *
+ * Return: 0 for acceptable syscalls, -1 for all others.
+ */
+extern int task_isolation_syscall(int nr);
+
+/**
+ * _task_isolation_interrupt() - report an interrupt of an isolated task
+ * @fmt: A format string describing the interrupt
+ * @...: Format arguments, if any.
+ *
+ * This routine should be invoked at any exception or IRQ if
+ * TIF_TASK_ISOLATION is set in the thread_info flags. It is not necessary
+ * to invoke it if the exception will generate a signal anyway (e.g. a bad
+ * page fault), and in that case it is preferable not to invoke it but just
+ * rely on the standard Linux signal. The macro task_isolation_syscall()
+ * wraps the TIF_TASK_ISOLATION flag test to simplify the caller code.
+ */
+extern void _task_isolation_interrupt(const char *fmt, ...);
+#define task_isolation_interrupt(fmt, ...) \
+ do { \
+ if (current_thread_info()->flags & _TIF_TASK_ISOLATION) \
+ _task_isolation_interrupt(fmt, ## __VA_ARGS__); \
+ } while (0)
+
+/**
+ * _task_isolation_remote() - report a remote interrupt of an isolated task
+ * @cpu: The remote cpu that is about to be interrupted.
+ * @do_interrupt: Whether we should generate an extra interrupt.
+ * @fmt: A format string describing the interrupt
+ * @...: Format arguments, if any.
+ *
+ * This routine should be invoked any time a remote IPI or other type of
+ * interrupt is being delivered to another cpu. The function will check to
+ * see if the target core is running a task-isolation task, and generate a
+ * diagnostic on the console if so; in addition, we tag the task so it
+ * doesn't generate another diagnostic when the interrupt actually arrives.
+ * Generating a diagnostic remotely yields a clearer indication of what
+ * happened then just reporting only when the remote core is interrupted.
+ *
+ * The @do_interrupt flag, if true, causes the routine to not just print
+ * the diagnostic, but also to generate a reschedule interrupt to the
+ * remote core that is being interrupted. This is necessary if the remote
+ * interrupt being diagnosed will not otherwise be visible to the remote
+ * core (e.g. a hypervisor service is being invoked on the remote core).
+ * Sending a reschedule will force the core to trigger the isolation signal
+ * and exit isolation mode.
+ *
+ * The task_isolation_remote() macro passes @do_interrupt as false, and the
+ * task_isolation_remote_interrupt() passes the flag as true.
+ */
+extern void _task_isolation_remote(int cpu, bool do_interrupt,
+ const char *fmt, ...);
+#define task_isolation_remote(cpu, fmt, ...) \
+ _task_isolation_remote(cpu, false, fmt, ## __VA_ARGS__)
+#define task_isolation_remote_interrupt(cpu, fmt, ...) \
+ _task_isolation_remote(cpu, true, fmt, ## __VA_ARGS__)
+
+/**
+ * _task_isolation_remote_cpumask() - report interruption of multiple cpus
+ * @mask: The set of remotes cpus that are about to be interrupted.
+ * @do_interrupt: Whether we should generate an extra interrupt.
+ * @fmt: A format string describing the interrupt
+ * @...: Format arguments, if any.
+ *
+ * This is the cpumask variant of _task_isolation_remote(). We
+ * generate a single-line diagnostic message even if multiple remote
+ * task-isolation cpus are being interrupted.
+ */
+extern void _task_isolation_remote_cpumask(const struct cpumask *mask,
+ bool do_interrupt,
+ const char *fmt, ...);
+#define task_isolation_remote_cpumask(cpumask, fmt, ...) \
+ _task_isolation_remote_cpumask(cpumask, false, fmt, ## __VA_ARGS__)
+#define task_isolation_remote_cpumask_interrupt(cpumask, fmt, ...) \
+ _task_isolation_remote_cpumask(cpumask, true, fmt, ## __VA_ARGS__)
+
+/**
+ * _task_isolation_signal() - disable task isolation when signal is pending
+ * @task: The task for which to disable isolation.
+ *
+ * This function generates a diagnostic and disables task isolation; it
+ * should be called if TIF_TASK_ISOLATION is set when notifying a task of a
+ * pending signal. The task_isolation_interrupt() function normally
+ * generates a diagnostic for events that just interrupt a task without
+ * generating a signal; here we need to hook the paths that correspond to
+ * interrupts that do generate a signal. The macro task_isolation_signal()
+ * wraps the TIF_TASK_ISOLATION flag test to simplify the caller code.
+ */
+extern void _task_isolation_signal(struct task_struct *task);
+#define task_isolation_signal(task) do { \
+ if (task_thread_info(task)->flags & _TIF_TASK_ISOLATION) \
+ _task_isolation_signal(task); \
+ } while (0)
+
+/**
+ * task_isolation_user_exit() - debug all user_exit calls
+ *
+ * By default, we don't generate an exception in the low-level user_exit()
+ * code, because programs lose the ability to disable task isolation: the
+ * user_exit() hook will cause a signal prior to task_isolation_syscall()
+ * disabling task isolation. In addition, it means that we lose all the
+ * diagnostic info otherwise available from task_isolation_interrupt() hooks
+ * later in the interrupt-handling process. But you may enable it here for
+ * a special kernel build if you are having undiagnosed userspace jitter.
+ */
+static inline void task_isolation_user_exit(void)
+{
+#ifdef DEBUG_TASK_ISOLATION
+ task_isolation_interrupt("user_exit");
+#endif
+}
+
+#else /* !CONFIG_TASK_ISOLATION */
+static inline int task_isolation_request(unsigned int flags) { return -EINVAL; }
+static inline void task_isolation_start(void) { }
+static inline int task_isolation_syscall(int nr) { return 0; }
+static inline void task_isolation_interrupt(const char *fmt, ...) { }
+static inline void task_isolation_remote(int cpu, const char *fmt, ...) { }
+static inline void task_isolation_remote_interrupt(int cpu,
+ const char *fmt, ...) { }
+static inline void task_isolation_remote_cpumask(const struct cpumask *mask,
+ const char *fmt, ...) { }
+static inline void task_isolation_remote_cpumask_interrupt(
+ const struct cpumask *mask, const char *fmt, ...) { }
+static inline void task_isolation_signal(struct task_struct *task) { }
+static inline void task_isolation_user_exit(void) { }
+#endif
+
+#endif /* _LINUX_ISOLATION_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 26a7df4e558c..739d81a44e13 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1096,6 +1096,10 @@ struct task_struct {
/* Used by LSM modules for access restriction: */
void *security;
#endif
+#ifdef CONFIG_TASK_ISOLATION
+ unsigned short task_isolation_flags; /* prctl */
+ unsigned short task_isolation_state;
+#endif

/*
* New fields for task_struct should be added above here, so that
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index a8d0759a9e40..5a954302fdea 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -197,4 +197,10 @@ struct prctl_mm_map {
# define PR_CAP_AMBIENT_LOWER 3
# define PR_CAP_AMBIENT_CLEAR_ALL 4

+/* Enable task_isolation mode for TASK_ISOLATION kernels. */
+#define PR_TASK_ISOLATION 48
+# define PR_TASK_ISOLATION_ENABLE (1 << 0)
+# define PR_TASK_ISOLATION_SET_SIG(sig) (((sig) & 0x7f) << 8)
+# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f)
+
#endif /* _LINUX_PRCTL_H */
diff --git a/init/Kconfig b/init/Kconfig
index 78cb2461012e..500c2aeb49b7 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -474,6 +474,34 @@ endmenu # "CPU/Task time and stats accounting"

source "kernel/rcu/Kconfig"

+config HAVE_ARCH_TASK_ISOLATION
+ bool
+
+config TASK_ISOLATION
+ bool "Provide hard CPU isolation from the kernel on demand"
+ depends on NO_HZ_FULL && HAVE_ARCH_TASK_ISOLATION
+ help
+
+ Allow userspace processes that place themselves on cores with
+ nohz_full and isolcpus enabled, and run prctl(PR_TASK_ISOLATION),
+ to "isolate" themselves from the kernel. Prior to returning to
+ userspace, isolated tasks will arrange that no future kernel
+ activity will interrupt the task while the task is running in
+ userspace. Attempting to re-enter the kernel while in this mode
+ will cause the task to be terminated with a signal; you must
+ explicitly use prctl() to disable task isolation before resuming
+ normal use of the kernel.
+
+ This "hard" isolation from the kernel is required for userspace
+ tasks that are running hard real-time tasks in userspace, such as
+ a high-speed network driver in userspace. Without this option, but
+ with NO_HZ_FULL enabled, the kernel will make a best-faith, "soft"
+ effort to shield a single userspace process from interrupts, but
+ makes no guarantees.
+
+ You should say "N" unless you are intending to run a
+ high-performance userspace driver or similar task.
+
config BUILD_BIN2C
bool
default n
diff --git a/kernel/Makefile b/kernel/Makefile
index ed470aac53da..21c908c3083c 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -111,6 +111,7 @@ obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o
obj-$(CONFIG_TORTURE_TEST) += torture.o

obj-$(CONFIG_HAS_IOMEM) += memremap.o
+obj-$(CONFIG_TASK_ISOLATION) += isolation.o

$(obj)/configs.o: $(obj)/config_data.h

diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 9ad37b9e44a7..df0c7a07c11f 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -20,6 +20,7 @@
#include <linux/hardirq.h>
#include <linux/export.h>
#include <linux/kprobes.h>
+#include <linux/isolation.h>

#define CREATE_TRACE_POINTS
#include <trace/events/context_tracking.h>
@@ -156,6 +157,7 @@ void __context_tracking_exit(enum ctx_state state)
if (state == CONTEXT_USER) {
vtime_user_exit(current);
trace_user_exit(0);
+ task_isolation_user_exit();
}
}
__this_cpu_write(context_tracking.state, CONTEXT_KERNEL);
diff --git a/kernel/isolation.c b/kernel/isolation.c
new file mode 100644
index 000000000000..e2baa21af03a
--- /dev/null
+++ b/kernel/isolation.c
@@ -0,0 +1,402 @@
+/*
+ * linux/kernel/isolation.c
+ *
+ * Implementation for task isolation.
+ *
+ * Distributed under GPLv2.
+ */
+
+#include <linux/mm.h>
+#include <linux/swap.h>
+#include <linux/vmstat.h>
+#include <linux/isolation.h>
+#include <linux/syscalls.h>
+#include <linux/smp.h>
+#include <linux/tick.h>
+#include <asm/unistd.h>
+#include <asm/syscall.h>
+#include "time/tick-sched.h"
+
+/*
+ * These values are stored in task_isolation_state.
+ * Note that STATE_NORMAL + TIF_TASK_ISOLATION means we are still
+ * returning from sys_prctl() to userspace.
+ */
+enum {
+ STATE_NORMAL = 0, /* Not isolated */
+ STATE_ISOLATED = 1, /* In userspace, isolated */
+ STATE_WARNED = 2 /* Like ISOLATED but console warning issued */
+};
+
+cpumask_var_t task_isolation_map;
+
+/* We can run on cpus that are isolated from the scheduler and are nohz_full. */
+static int __init task_isolation_init(void)
+{
+ if (alloc_cpumask_var(&task_isolation_map, GFP_KERNEL))
+ cpumask_and(task_isolation_map, cpu_isolated_map,
+ tick_nohz_full_mask);
+
+ return 0;
+}
+core_initcall(task_isolation_init)
+
+static inline bool is_isolation_cpu(int cpu)
+{
+ return task_isolation_map != NULL &&
+ cpumask_test_cpu(cpu, task_isolation_map);
+}
+
+/* Enable stack backtraces of any interrupts of task_isolation cores. */
+static bool task_isolation_debug;
+static int __init task_isolation_debug_func(char *str)
+{
+ task_isolation_debug = true;
+ return 1;
+}
+__setup("task_isolation_debug", task_isolation_debug_func);
+
+/*
+ * Dump stack if need be. This can be helpful even from the final exit
+ * to usermode code since stack traces sometimes carry information about
+ * what put you into the kernel, e.g. an interrupt number encoded in
+ * the initial entry stack frame that is still visible at exit time.
+ */
+static void debug_dump_stack(void)
+{
+ if (task_isolation_debug)
+ dump_stack();
+}
+
+/*
+ * Set the flags word but don't try to actually start task isolation yet.
+ * We will start it when entering user space in task_isolation_start().
+ */
+int task_isolation_request(unsigned int flags)
+{
+ struct task_struct *task = current;
+
+ /*
+ * The task isolation flags should always be cleared just by
+ * virtue of having entered the kernel.
+ */
+ WARN_ON_ONCE(test_tsk_thread_flag(task, TIF_TASK_ISOLATION));
+ WARN_ON_ONCE(task->task_isolation_flags != 0);
+ WARN_ON_ONCE(task->task_isolation_state != STATE_NORMAL);
+
+ task->task_isolation_flags = flags;
+ if (!(task->task_isolation_flags & PR_TASK_ISOLATION_ENABLE))
+ return 0;
+
+ /* We are trying to enable task isolation. */
+ set_tsk_thread_flag(task, TIF_TASK_ISOLATION);
+
+ /*
+ * Shut down the vmstat worker so we're not interrupted later.
+ * We have to try to do this here (with interrupts enabled) since
+ * we are canceling delayed work and will call flush_work()
+ * (which enables interrupts) and possibly schedule().
+ */
+ quiet_vmstat_sync();
+
+ /* We return 0 here but we may change that in task_isolation_start(). */
+ return 0;
+}
+
+/* Disable task isolation in the specified task. */
+static void stop_isolation(struct task_struct *p)
+{
+ p->task_isolation_flags = 0;
+ p->task_isolation_state = STATE_NORMAL;
+ clear_tsk_thread_flag(p, TIF_TASK_ISOLATION);
+}
+
+/*
+ * This code runs with interrupts disabled just before the return to
+ * userspace, after a prctl() has requested enabling task isolation.
+ * We take whatever steps are needed to avoid being interrupted later:
+ * drain the lru pages, stop the scheduler tick, etc. More
+ * functionality may be added here later to avoid other types of
+ * interrupts from other kernel subsystems.
+ *
+ * If we can't enable task isolation, we update the syscall return
+ * value with an appropriate error.
+ */
+void task_isolation_start(void)
+{
+ int error;
+
+ /*
+ * We should only be called in STATE_NORMAL (isolation disabled),
+ * on our way out of the kernel from the prctl() that turned it on.
+ * If we are exiting from the kernel in another state, it means we
+ * made it back into the kernel without disabling task isolation,
+ * and we should investigate how (and in any case disable task
+ * isolation at this point). We are clearly not on the path back
+ * from the prctl() so we don't touch the syscall return value.
+ */
+ if (WARN_ON_ONCE(current->task_isolation_state != STATE_NORMAL)) {
+ stop_isolation(current);
+ return;
+ }
+
+ /*
+ * Must be affinitized to a single core with task isolation possible.
+ * In principle this could be remotely modified between the prctl()
+ * and the return to userspace, so we have to check it here.
+ */
+ if (cpumask_weight(&current->cpus_allowed) != 1 ||
+ !is_isolation_cpu(smp_processor_id())) {
+ error = -EINVAL;
+ goto error;
+ }
+
+ /* If the vmstat delayed work is not canceled, we have to try again. */
+ if (!vmstat_idle()) {
+ error = -EAGAIN;
+ goto error;
+ }
+
+ /* Try to stop the dynamic tick. */
+ error = try_stop_full_tick();
+ if (error)
+ goto error;
+
+ /* Drain the pagevecs to avoid unnecessary IPI flushes later. */
+ lru_add_drain();
+
+ current->task_isolation_state = STATE_ISOLATED;
+ return;
+
+error:
+ stop_isolation(current);
+ syscall_set_return_value(current, current_pt_regs(), error, 0);
+}
+
+/* Stop task isolation on the remote task and send it a signal. */
+static void send_isolation_signal(struct task_struct *task)
+{
+ int flags = task->task_isolation_flags;
+ siginfo_t info = {
+ .si_signo = PR_TASK_ISOLATION_GET_SIG(flags) ?: SIGKILL,
+ };
+
+ stop_isolation(task);
+ send_sig_info(info.si_signo, &info, task);
+}
+
+/* Only a few syscalls are valid once we are in task isolation mode. */
+static bool is_acceptable_syscall(int syscall)
+{
+ /* No need to incur an isolation signal if we are just exiting. */
+ if (syscall == __NR_exit || syscall == __NR_exit_group)
+ return true;
+
+ /* Check to see if it's the prctl for isolation. */
+ if (syscall == __NR_prctl) {
+ unsigned long arg;
+
+ syscall_get_arguments(current, current_pt_regs(), 0, 1, &arg);
+ if (arg == PR_TASK_ISOLATION)
+ return true;
+ }
+
+ return false;
+}
+
+/*
+ * This routine is called from syscall entry, prevents most syscalls
+ * from executing, and if needed raises a signal to notify the process.
+ *
+ * Note that we have to stop isolation before we even print a message
+ * here, since otherwise we might end up reporting an interrupt due to
+ * kicking the printk handling code, rather than reporting the true
+ * cause of interrupt here.
+ */
+int task_isolation_syscall(int syscall)
+{
+ struct task_struct *task = current;
+
+ if (is_acceptable_syscall(syscall)) {
+ stop_isolation(task);
+ return 0;
+ }
+
+ send_isolation_signal(task);
+
+ pr_warn("%s/%d (cpu %d): task_isolation lost due to syscall %d\n",
+ task->comm, task->pid, smp_processor_id(), syscall);
+ debug_dump_stack();
+
+ syscall_set_return_value(task, current_pt_regs(), -ERESTARTNOINTR, -1);
+ return -1;
+}
+
+/*
+ * This routine is called from any exception or irq that doesn't
+ * otherwise trigger a signal to the user process (e.g. page fault).
+ * We don't warn if we are in STATE_WARNED in case a remote cpu already
+ * reported that it was going to interrupt us, so we don't generate
+ * a lot of confusingly similar messages about the same event.
+ */
+void _task_isolation_interrupt(const char *fmt, ...)
+{
+ struct task_struct *task = current;
+ va_list args;
+ char buf[100];
+ bool do_warn;
+
+ /* RCU should have been enabled prior to this point. */
+ RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU");
+
+ /*
+ * Avoid reporting interrupts that happen after we have prctl'ed
+ * to enable isolation, but before we have returned to userspace.
+ */
+ if (task->task_isolation_state == STATE_NORMAL)
+ return;
+
+ do_warn = (task->task_isolation_state == STATE_ISOLATED);
+
+ va_start(args, fmt);
+ vsnprintf(buf, sizeof(buf), fmt, args);
+ va_end(args);
+
+ /* Handle NMIs minimally, since we can't send a signal. */
+ if (in_nmi()) {
+ pr_err("%s/%d (cpu %d): in NMI; not delivering signal\n",
+ task->comm, task->pid, smp_processor_id());
+ } else {
+ send_isolation_signal(task);
+ }
+
+ if (do_warn) {
+ pr_warn("%s/%d (cpu %d): task_isolation lost due to %s\n",
+ task->comm, task->pid, smp_processor_id(), buf);
+ debug_dump_stack();
+ }
+}
+
+/*
+ * Called before we wake up a task that has a signal to process.
+ * Needs to be done to handle interrupts that trigger signals, which
+ * we don't catch with task_isolation_interrupt() hooks.
+ */
+void _task_isolation_signal(struct task_struct *task)
+{
+ bool do_warn = (task->task_isolation_state == STATE_ISOLATED);
+
+ stop_isolation(task);
+ if (do_warn) {
+ pr_warn("%s/%d (cpu %d): task_isolation lost due to signal\n",
+ task->comm, task->pid, task_cpu(task));
+ debug_dump_stack();
+ }
+}
+
+/*
+ * Return a task_struct pointer (with ref count bumped up) for the
+ * specified cpu if the task running on that cpu at this moment is in
+ * isolation mode and hasn't yet been warned, otherwise NULL.
+ * In addition, toggle the task state to WARNED in anticipation of
+ * doing a printk, and send a reschedule IPI if needed.
+ */
+static struct task_struct *isolation_task(int cpu, int do_interrupt)
+{
+ struct task_struct *p = try_get_task_struct_on_cpu(cpu);
+
+ if (p == NULL)
+ return NULL;
+
+ if (p->task_isolation_state != STATE_ISOLATED)
+ goto bad_task;
+
+ /*
+ * If we are claiming to be delivering a remote interrupt to our
+ * own task, this has to be a bug, since here we are already in the
+ * kernel, and somehow we didn't reset to STATE_NORMAL.
+ */
+ if (WARN_ON_ONCE(p == current)) {
+ stop_isolation(p);
+ goto bad_task;
+ }
+
+ p->task_isolation_state = STATE_WARNED;
+ if (do_interrupt)
+ smp_send_reschedule(cpu);
+
+ return p;
+
+bad_task:
+ put_task_struct(p);
+ return NULL;
+}
+
+/*
+ * Generate a stack backtrace if we are going to interrupt another task
+ * isolation process.
+ */
+void _task_isolation_remote(int cpu, bool do_interrupt, const char *fmt, ...)
+{
+ struct task_struct *p;
+ va_list args;
+ char buf[200];
+
+ if (!is_isolation_cpu(cpu))
+ return;
+
+ p = isolation_task(cpu, do_interrupt);
+ if (p == NULL)
+ return;
+
+ va_start(args, fmt);
+ vsnprintf(buf, sizeof(buf), fmt, args);
+ va_end(args);
+ pr_warn("%s/%d (cpu %d): task_isolation lost due to %s by %s/%d on cpu %d\n",
+ p->comm, p->pid, cpu, buf,
+ current->comm, current->pid, smp_processor_id());
+ put_task_struct(p);
+ debug_dump_stack();
+}
+
+/*
+ * Generate a stack backtrace if any of the cpus in "mask" are running
+ * task isolation processes.
+ */
+void _task_isolation_remote_cpumask(const struct cpumask *mask,
+ bool do_interrupt, const char *fmt, ...)
+{
+ struct task_struct *p = NULL;
+ cpumask_var_t warn_mask;
+ va_list args;
+ char buf[200];
+ int cpu;
+
+ if (task_isolation_map == NULL ||
+ !zalloc_cpumask_var(&warn_mask, GFP_KERNEL))
+ return;
+
+ for_each_cpu_and(cpu, mask, task_isolation_map) {
+ if (p)
+ put_task_struct(p);
+ p = isolation_task(cpu, do_interrupt);
+ if (p)
+ cpumask_set_cpu(cpu, warn_mask);
+ }
+ if (p == NULL)
+ goto done;
+
+ va_start(args, fmt);
+ vsnprintf(buf, sizeof(buf), fmt, args);
+ va_end(args);
+ pr_warn("%s/%d %s %*pbl): task_isolation lost due to %s by %s/%d on cpu %d\n",
+ p->comm, p->pid,
+ cpumask_weight(warn_mask) == 1 ? "(cpu" : "etc (cpus",
+ cpumask_pr_args(warn_mask), buf,
+ current->comm, current->pid, smp_processor_id());
+ put_task_struct(p);
+ debug_dump_stack();
+
+done:
+ free_cpumask_var(warn_mask);
+}
diff --git a/kernel/signal.c b/kernel/signal.c
index 800a18f77732..fa4786050431 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -40,6 +40,7 @@
#include <linux/cn_proc.h>
#include <linux/compiler.h>
#include <linux/posix-timers.h>
+#include <linux/isolation.h>

#define CREATE_TRACE_POINTS
#include <trace/events/signal.h>
@@ -658,6 +659,7 @@ int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info)
*/
void signal_wake_up_state(struct task_struct *t, unsigned int state)
{
+ task_isolation_signal(t);
set_tsk_thread_flag(t, TIF_SIGPENDING);
/*
* TASK_WAKEKILL also means wake it up in the stopped/traced/killable
diff --git a/kernel/sys.c b/kernel/sys.c
index 9aebc2935013..91ed79605768 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -41,6 +41,7 @@
#include <linux/syscore_ops.h>
#include <linux/version.h>
#include <linux/ctype.h>
+#include <linux/isolation.h>

#include <linux/compat.h>
#include <linux/syscalls.h>
@@ -2385,6 +2386,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
case PR_GET_FP_MODE:
error = GET_FP_MODE(me);
break;
+ case PR_TASK_ISOLATION:
+ if (arg3 || arg4 || arg5)
+ return -EINVAL;
+ error = task_isolation_request(arg2);
+ break;
default:
error = -EINVAL;
break;
--
2.1.2


From 1583065503903914482@xxx Fri Nov 03 17:08:32 +0000 2017
X-GM-THRID: 1583065503903914482
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread

2017-11-03 17:08:32

by Chris Metcalf

[permalink] [raw]
Subject: [PATCH v16 04/13] Add try_get_task_struct_on_cpu() to scheduler for task isolation

Task isolation wants to be able to verify that a remote core is
running an isolated task to determine if it should generate a
diagnostic, and also possibly interrupt it.

This API returns a pointer to the task_struct of the task that
was running on the specified core at the moment of the request;
it uses try_get_task_struct() to increment the ref count on the
returned task_struct so that the caller can examine it even if
the actual remote task has already exited by that point.

Signed-off-by: Chris Metcalf <[email protected]>
---
include/linux/sched/task.h | 1 +
kernel/sched/core.c | 11 +++++++++++
2 files changed, 12 insertions(+)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 270ff76d43d9..6785db926857 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -97,6 +97,7 @@ static inline void put_task_struct(struct task_struct *t)

struct task_struct *task_rcu_dereference(struct task_struct **ptask);
struct task_struct *try_get_task_struct(struct task_struct **ptask);
+struct task_struct *try_get_task_struct_on_cpu(int cpu);


#ifdef CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d17c5da523a0..2728154057ae 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -670,6 +670,17 @@ bool sched_can_stop_tick(struct rq *rq)
}
#endif /* CONFIG_NO_HZ_FULL */

+/*
+ * Return a pointer to the task_struct for the task that is running on
+ * the specified cpu at the time of the call (note that the task may have
+ * exited by the time the caller inspects the resulting task_struct).
+ * Caller must put_task_struct() with the pointer when finished with it.
+ */
+struct task_struct *try_get_task_struct_on_cpu(int cpu)
+{
+ return try_get_task_struct(&cpu_rq(cpu)->curr);
+}
+
void sched_avg_update(struct rq *rq)
{
s64 period = sched_avg_period();
--
2.1.2


From 1583083259778699206@xxx Fri Nov 03 21:50:45 +0000 2017
X-GM-THRID: 1582126742408023281
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread

2017-11-03 17:09:22

by Chris Metcalf

[permalink] [raw]
Subject: [PATCH v16 08/13] arch/x86: enable task isolation functionality

In prepare_exit_to_usermode(), call task_isolation_start() for
TIF_TASK_ISOLATION tasks.

In syscall_trace_enter_phase1(), add the necessary support for
reporting syscalls for task-isolation processes.

Add task_isolation_remote() calls for the kernel exception types
that do not result in signals, namely non-signalling page faults
and non-signalling MPX fixups.

Signed-off-by: Chris Metcalf <[email protected]>
---
arch/x86/Kconfig | 1 +
arch/x86/entry/common.c | 14 ++++++++++++++
arch/x86/include/asm/apic.h | 3 +++
arch/x86/include/asm/thread_info.h | 8 +++++---
arch/x86/kernel/smp.c | 2 ++
arch/x86/kernel/traps.c | 3 +++
arch/x86/mm/fault.c | 5 +++++
7 files changed, 33 insertions(+), 3 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 971feac13506..45967840b81a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -114,6 +114,7 @@ config X86
select HAVE_ARCH_MMAP_RND_COMPAT_BITS if MMU && COMPAT
select HAVE_ARCH_COMPAT_MMAP_BASES if MMU && COMPAT
select HAVE_ARCH_SECCOMP_FILTER
+ select HAVE_ARCH_TASK_ISOLATION
select HAVE_ARCH_TRACEHOOK
select HAVE_ARCH_TRANSPARENT_HUGEPAGE
select HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD if X86_64
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 03505ffbe1b6..2c70b915d1f2 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -24,6 +24,7 @@
#include <linux/uprobes.h>
#include <linux/livepatch.h>
#include <linux/syscalls.h>
+#include <linux/isolation.h>

#include <asm/desc.h>
#include <asm/traps.h>
@@ -87,6 +88,16 @@ static long syscall_trace_enter(struct pt_regs *regs)
if (emulated)
return -1L;

+ /*
+ * In task isolation mode, we may prevent the syscall from
+ * running, and if so we also deliver a signal to the process.
+ */
+ if (work & _TIF_TASK_ISOLATION) {
+ if (task_isolation_syscall(regs->orig_ax) == -1)
+ return -1L;
+ work &= ~_TIF_TASK_ISOLATION;
+ }
+
#ifdef CONFIG_SECCOMP
/*
* Do seccomp after ptrace, to catch any tracer changes.
@@ -196,6 +207,9 @@ __visible inline void prepare_exit_to_usermode(struct pt_regs *regs)
if (unlikely(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS))
exit_to_usermode_loop(regs, cached_flags);

+ if (cached_flags & _TIF_TASK_ISOLATION)
+ task_isolation_start();
+
#ifdef CONFIG_COMPAT
/*
* Compat syscalls set TS_COMPAT. Make sure we clear it before
diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index 5f01671c68f2..c70cb9cacfc0 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -2,6 +2,7 @@
#define _ASM_X86_APIC_H

#include <linux/cpumask.h>
+#include <linux/isolation.h>

#include <asm/alternative.h>
#include <asm/cpufeature.h>
@@ -618,6 +619,7 @@ extern void irq_exit(void);

static inline void entering_irq(void)
{
+ task_isolation_interrupt("irq");
irq_enter();
}

@@ -629,6 +631,7 @@ static inline void entering_ack_irq(void)

static inline void ipi_entering_ack_irq(void)
{
+ task_isolation_interrupt("ack irq");
irq_enter();
ack_APIC_irq();
}
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 89e7eeb5cec1..aa9d9d817f8b 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -85,6 +85,7 @@ struct thread_info {
#define TIF_USER_RETURN_NOTIFY 11 /* notify kernel of userspace return */
#define TIF_UPROBE 12 /* breakpointed or singlestepping */
#define TIF_PATCH_PENDING 13 /* pending live patching update */
+#define TIF_TASK_ISOLATION 14 /* task isolation enabled for task */
#define TIF_NOCPUID 15 /* CPUID is not accessible in userland */
#define TIF_NOTSC 16 /* TSC is not accessible in userland */
#define TIF_IA32 17 /* IA32 compatibility process */
@@ -111,6 +112,7 @@ struct thread_info {
#define _TIF_USER_RETURN_NOTIFY (1 << TIF_USER_RETURN_NOTIFY)
#define _TIF_UPROBE (1 << TIF_UPROBE)
#define _TIF_PATCH_PENDING (1 << TIF_PATCH_PENDING)
+#define _TIF_TASK_ISOLATION (1 << TIF_TASK_ISOLATION)
#define _TIF_NOCPUID (1 << TIF_NOCPUID)
#define _TIF_NOTSC (1 << TIF_NOTSC)
#define _TIF_IA32 (1 << TIF_IA32)
@@ -132,15 +134,15 @@ struct thread_info {
#define _TIF_WORK_SYSCALL_ENTRY \
(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_EMU | _TIF_SYSCALL_AUDIT | \
_TIF_SECCOMP | _TIF_SYSCALL_TRACEPOINT | \
- _TIF_NOHZ)
+ _TIF_NOHZ | _TIF_TASK_ISOLATION)

/* work to do on any return to user space */
#define _TIF_ALLWORK_MASK \
(_TIF_SYSCALL_TRACE | _TIF_NOTIFY_RESUME | _TIF_SIGPENDING | \
_TIF_NEED_RESCHED | _TIF_SINGLESTEP | _TIF_SYSCALL_EMU | \
_TIF_SYSCALL_AUDIT | _TIF_USER_RETURN_NOTIFY | _TIF_UPROBE | \
- _TIF_PATCH_PENDING | _TIF_NOHZ | _TIF_SYSCALL_TRACEPOINT | \
- _TIF_FSCHECK)
+ _TIF_PATCH_PENDING | _TIF_TASK_ISOLATION | _TIF_NOHZ | \
+ _TIF_SYSCALL_TRACEPOINT | _TIF_NOHZ | _TIF_FSCHECK)

/* flags to check in __switch_to() */
#define _TIF_WORK_CTXSW \
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 5c574dff4c1a..73bc58029292 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -23,6 +23,7 @@
#include <linux/interrupt.h>
#include <linux/cpu.h>
#include <linux/gfp.h>
+#include <linux/isolation.h>

#include <asm/mtrr.h>
#include <asm/tlbflush.h>
@@ -128,6 +129,7 @@ static void native_smp_send_reschedule(int cpu)
WARN(1, "sched: Unexpected reschedule of offline CPU#%d!\n", cpu);
return;
}
+ task_isolation_remote(cpu, "reschedule IPI");
apic->send_IPI(cpu, RESCHEDULE_VECTOR);
}

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 67db4f43309e..a48ebfc7b98c 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -37,6 +37,7 @@
#include <linux/mm.h>
#include <linux/smp.h>
#include <linux/io.h>
+#include <linux/isolation.h>

#if defined(CONFIG_EDAC)
#include <linux/edac.h>
@@ -467,6 +468,8 @@ dotraplinkage void do_bounds(struct pt_regs *regs, long error_code)
case 2: /* Bound directory has invalid entry. */
if (mpx_handle_bd_fault())
goto exit_trap;
+ /* No signal was generated, but notify task-isolation tasks. */
+ task_isolation_interrupt("bounds check");
break; /* Success, it was handled */
case 1: /* Bound violation. */
info = mpx_generate_siginfo(regs);
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index e2baeaa053a5..bd479142e3ca 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -15,6 +15,7 @@
#include <linux/prefetch.h> /* prefetchw */
#include <linux/context_tracking.h> /* exception_enter(), ... */
#include <linux/uaccess.h> /* faulthandler_disabled() */
+#include <linux/isolation.h> /* task_isolation_interrupt */

#include <asm/cpufeature.h> /* boot_cpu_has, ... */
#include <asm/traps.h> /* dotraplinkage, ... */
@@ -1486,6 +1487,10 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address);
}

+ /* No signal was generated, but notify task-isolation tasks. */
+ if (flags & PF_USER)
+ task_isolation_interrupt("page fault at %#lx", address);
+
check_v8086_mode(regs, address, tsk);
}
NOKPROBE_SYMBOL(__do_page_fault);
--
2.1.2


From 1583068357500724364@xxx Fri Nov 03 17:53:53 +0000 2017
X-GM-THRID: 1583068357500724364
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread

2017-11-03 17:08:33

by Chris Metcalf

[permalink] [raw]
Subject: [PATCH v16 03/13] Revert "sched/core: Drop the unused try_get_task_struct() helper function"

This reverts commit f11cc0760b8397e0d230122606421b6a96e9f869.
We do need this function for try_get_task_struct_on_cpu().

Signed-off-by: Chris Metcalf <[email protected]>
---
include/linux/sched/task.h | 2 ++
kernel/exit.c | 13 +++++++++++++
2 files changed, 15 insertions(+)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 79a2a744648d..270ff76d43d9 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -96,6 +96,8 @@ static inline void put_task_struct(struct task_struct *t)
}

struct task_struct *task_rcu_dereference(struct task_struct **ptask);
+struct task_struct *try_get_task_struct(struct task_struct **ptask);
+

#ifdef CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT
extern int arch_task_struct_size __read_mostly;
diff --git a/kernel/exit.c b/kernel/exit.c
index f2cd53e92147..e2a3e7458d0f 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -318,6 +318,19 @@ void rcuwait_wake_up(struct rcuwait *w)
rcu_read_unlock();
}

+struct task_struct *try_get_task_struct(struct task_struct **ptask)
+{
+ struct task_struct *task;
+
+ rcu_read_lock();
+ task = task_rcu_dereference(ptask);
+ if (task)
+ get_task_struct(task);
+ rcu_read_unlock();
+
+ return task;
+}
+
/*
* Determine if a process group is "orphaned", according to the POSIX
* definition in 2.2.2.52. Orphaned process groups are not to be affected
--
2.1.2


From 1583071311240216750@xxx Fri Nov 03 18:40:50 +0000 2017
X-GM-THRID: 1583066064950739293
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread

2017-11-03 17:09:11

by Chris Metcalf

[permalink] [raw]
Subject: [PATCH v16 11/13] arch/tile: enable task isolation functionality

We add the necessary call to task_isolation_start() in the
prepare_exit_to_usermode() routine. We already unconditionally
call into this routine if TIF_NOHZ is set, since that's where
we do the user_enter() call.

We add calls to task_isolation_interrupt() in places where exceptions
may not generate signals to the application.

Signed-off-by: Chris Metcalf <[email protected]>
---
arch/tile/Kconfig | 1 +
arch/tile/include/asm/thread_info.h | 2 ++
arch/tile/kernel/hardwall.c | 2 ++
arch/tile/kernel/irq.c | 3 +++
arch/tile/kernel/messaging.c | 4 ++++
arch/tile/kernel/process.c | 4 ++++
arch/tile/kernel/ptrace.c | 10 ++++++++++
arch/tile/kernel/single_step.c | 7 +++++++
arch/tile/kernel/smp.c | 21 +++++++++++----------
arch/tile/kernel/time.c | 2 ++
arch/tile/kernel/unaligned.c | 4 ++++
arch/tile/mm/fault.c | 13 ++++++++++++-
arch/tile/mm/homecache.c | 11 +++++++++++
13 files changed, 73 insertions(+), 11 deletions(-)

diff --git a/arch/tile/Kconfig b/arch/tile/Kconfig
index 4583c0320059..2d644138f2eb 100644
--- a/arch/tile/Kconfig
+++ b/arch/tile/Kconfig
@@ -16,6 +16,7 @@ config TILE
select GENERIC_STRNCPY_FROM_USER
select GENERIC_STRNLEN_USER
select HAVE_ARCH_SECCOMP_FILTER
+ select HAVE_ARCH_TASK_ISOLATION
select HAVE_ARCH_TRACEHOOK
select HAVE_CONTEXT_TRACKING
select HAVE_DEBUG_BUGVERBOSE
diff --git a/arch/tile/include/asm/thread_info.h b/arch/tile/include/asm/thread_info.h
index b7659b8f1117..3e298bd43d11 100644
--- a/arch/tile/include/asm/thread_info.h
+++ b/arch/tile/include/asm/thread_info.h
@@ -126,6 +126,7 @@ extern void _cpu_idle(void);
#define TIF_SYSCALL_TRACEPOINT 9 /* syscall tracepoint instrumentation */
#define TIF_POLLING_NRFLAG 10 /* idle is polling for TIF_NEED_RESCHED */
#define TIF_NOHZ 11 /* in adaptive nohz mode */
+#define TIF_TASK_ISOLATION 12 /* in task isolation mode */

#define _TIF_SIGPENDING (1<<TIF_SIGPENDING)
#define _TIF_NEED_RESCHED (1<<TIF_NEED_RESCHED)
@@ -139,6 +140,7 @@ extern void _cpu_idle(void);
#define _TIF_SYSCALL_TRACEPOINT (1<<TIF_SYSCALL_TRACEPOINT)
#define _TIF_POLLING_NRFLAG (1<<TIF_POLLING_NRFLAG)
#define _TIF_NOHZ (1<<TIF_NOHZ)
+#define _TIF_TASK_ISOLATION (1<<TIF_TASK_ISOLATION)

/* Work to do as we loop to exit to user space. */
#define _TIF_WORK_MASK \
diff --git a/arch/tile/kernel/hardwall.c b/arch/tile/kernel/hardwall.c
index 2fd1694ac1d0..9559f04d1c2a 100644
--- a/arch/tile/kernel/hardwall.c
+++ b/arch/tile/kernel/hardwall.c
@@ -23,6 +23,7 @@
#include <linux/smp.h>
#include <linux/cdev.h>
#include <linux/compat.h>
+#include <linux/isolation.h>
#include <asm/hardwall.h>
#include <asm/traps.h>
#include <asm/siginfo.h>
@@ -328,6 +329,7 @@ void __kprobes do_hardwall_trap(struct pt_regs* regs, int fault_num)
int found_processes;
struct pt_regs *old_regs = set_irq_regs(regs);

+ task_isolation_interrupt("hardwall trap");
irq_enter();

/* Figure out which network trapped. */
diff --git a/arch/tile/kernel/irq.c b/arch/tile/kernel/irq.c
index 22044fc691ef..0b1b24b9c496 100644
--- a/arch/tile/kernel/irq.c
+++ b/arch/tile/kernel/irq.c
@@ -18,6 +18,7 @@
#include <linux/irq.h>
#include <linux/kernel_stat.h>
#include <linux/uaccess.h>
+#include <linux/isolation.h>
#include <hv/drv_pcie_rc_intf.h>
#include <arch/spr_def.h>
#include <asm/traps.h>
@@ -100,6 +101,8 @@ void tile_dev_intr(struct pt_regs *regs, int intnum)

/* Track time spent here in an interrupt context. */
old_regs = set_irq_regs(regs);
+
+ task_isolation_interrupt("IPI: IRQ mask %#lx", remaining_irqs);
irq_enter();

#ifdef CONFIG_DEBUG_STACKOVERFLOW
diff --git a/arch/tile/kernel/messaging.c b/arch/tile/kernel/messaging.c
index 7475af3aacec..1cf1630215f0 100644
--- a/arch/tile/kernel/messaging.c
+++ b/arch/tile/kernel/messaging.c
@@ -16,6 +16,7 @@
#include <linux/smp.h>
#include <linux/hardirq.h>
#include <linux/ptrace.h>
+#include <linux/isolation.h>
#include <asm/hv_driver.h>
#include <asm/irq_regs.h>
#include <asm/traps.h>
@@ -86,6 +87,7 @@ void hv_message_intr(struct pt_regs *regs, int intnum)

tag = message[0];
#ifdef CONFIG_SMP
+ task_isolation_interrupt("SMP message %d", tag);
evaluate_message(message[0]);
#else
panic("Received IPI message %d in UP mode", tag);
@@ -94,6 +96,8 @@ void hv_message_intr(struct pt_regs *regs, int intnum)
HV_IntrMsg *him = (HV_IntrMsg *)message;
struct hv_driver_cb *cb =
(struct hv_driver_cb *)him->intarg;
+ task_isolation_interrupt("interrupt message %#lx(%#lx)",
+ him->intarg, him->intdata);
cb->callback(cb, him->intdata);
__this_cpu_inc(irq_stat.irq_hv_msg_count);
}
diff --git a/arch/tile/kernel/process.c b/arch/tile/kernel/process.c
index f0a0e18e4dfb..ac22e971dc1d 100644
--- a/arch/tile/kernel/process.c
+++ b/arch/tile/kernel/process.c
@@ -32,6 +32,7 @@
#include <linux/signal.h>
#include <linux/delay.h>
#include <linux/context_tracking.h>
+#include <linux/isolation.h>
#include <asm/stack.h>
#include <asm/switch_to.h>
#include <asm/homecache.h>
@@ -516,6 +517,9 @@ void prepare_exit_to_usermode(struct pt_regs *regs, u32 thread_info_flags)
#endif
}

+ if (thread_info_flags & _TIF_TASK_ISOLATION)
+ task_isolation_start();
+
user_enter();
}

diff --git a/arch/tile/kernel/ptrace.c b/arch/tile/kernel/ptrace.c
index e1a078e6828e..908d57d3d2cf 100644
--- a/arch/tile/kernel/ptrace.c
+++ b/arch/tile/kernel/ptrace.c
@@ -24,6 +24,7 @@
#include <linux/tracehook.h>
#include <linux/context_tracking.h>
#include <linux/sched/task_stack.h>
+#include <linux/isolation.h>

#include <asm/traps.h>
#include <arch/chip.h>
@@ -263,6 +264,15 @@ int do_syscall_trace_enter(struct pt_regs *regs)
return -1;
}

+ /*
+ * In task isolation mode, we may prevent the syscall from
+ * running, and if so we also deliver a signal to the process.
+ */
+ if (work & _TIF_TASK_ISOLATION) {
+ if (task_isolation_syscall(regs->regs[TREG_SYSCALL_NR]) == -1)
+ return -1;
+ }
+
if (secure_computing(NULL) == -1)
return -1;

diff --git a/arch/tile/kernel/single_step.c b/arch/tile/kernel/single_step.c
index de3eae813e52..4d5d4897e70c 100644
--- a/arch/tile/kernel/single_step.c
+++ b/arch/tile/kernel/single_step.c
@@ -23,6 +23,7 @@
#include <linux/types.h>
#include <linux/err.h>
#include <linux/prctl.h>
+#include <linux/isolation.h>
#include <asm/cacheflush.h>
#include <asm/traps.h>
#include <linux/uaccess.h>
@@ -320,6 +321,9 @@ void single_step_once(struct pt_regs *regs)
int size = 0, sign_ext = 0; /* happy compiler */
int align_ctl;

+ /* No signal was generated, but notify task-isolation tasks. */
+ task_isolation_interrupt("single step at %#lx", regs->pc);
+
align_ctl = unaligned_fixup;
switch (task_thread_info(current)->align_ctl) {
case PR_UNALIGN_NOPRINT:
@@ -767,6 +771,9 @@ void single_step_once(struct pt_regs *regs)
unsigned long *ss_pc = this_cpu_ptr(&ss_saved_pc);
unsigned long control = __insn_mfspr(SPR_SINGLE_STEP_CONTROL_K);

+ /* No signal was generated, but notify task-isolation tasks. */
+ task_isolation_interrupt("single step at %#lx", regs->pc);
+
*ss_pc = regs->pc;
control |= SPR_SINGLE_STEP_CONTROL_1__CANCELED_MASK;
control |= SPR_SINGLE_STEP_CONTROL_1__INHIBIT_MASK;
diff --git a/arch/tile/kernel/smp.c b/arch/tile/kernel/smp.c
index 94a62e1197ce..6b100faa33ef 100644
--- a/arch/tile/kernel/smp.c
+++ b/arch/tile/kernel/smp.c
@@ -20,6 +20,7 @@
#include <linux/irq.h>
#include <linux/irq_work.h>
#include <linux/module.h>
+#include <linux/isolation.h>
#include <asm/cacheflush.h>
#include <asm/homecache.h>

@@ -258,10 +259,8 @@ void __init ipi_init(void)

#if CHIP_HAS_IPI()

-void smp_send_reschedule(int cpu)
+static void __smp_send_reschedule(int cpu)
{
- WARN_ON(cpu_is_offline(cpu));
-
/*
* We just want to do an MMIO store. The traditional writeq()
* functions aren't really correct here, since they're always
@@ -273,15 +272,17 @@ void smp_send_reschedule(int cpu)

#else

-void smp_send_reschedule(int cpu)
+static void __smp_send_reschedule(int cpu)
{
- HV_Coord coord;
-
- WARN_ON(cpu_is_offline(cpu));
-
- coord.y = cpu_y(cpu);
- coord.x = cpu_x(cpu);
+ HV_Coord coord = { .y = cpu_y(cpu), .x = cpu_x(cpu) };
hv_trigger_ipi(coord, IRQ_RESCHEDULE);
}

#endif /* CHIP_HAS_IPI() */
+
+void smp_send_reschedule(int cpu)
+{
+ WARN_ON(cpu_is_offline(cpu));
+ task_isolation_remote(cpu, "reschedule IPI");
+ __smp_send_reschedule(cpu);
+}
diff --git a/arch/tile/kernel/time.c b/arch/tile/kernel/time.c
index 6643ffbc0615..f74f10d827fa 100644
--- a/arch/tile/kernel/time.c
+++ b/arch/tile/kernel/time.c
@@ -25,6 +25,7 @@
#include <linux/delay.h>
#include <linux/module.h>
#include <linux/timekeeper_internal.h>
+#include <linux/isolation.h>
#include <asm/irq_regs.h>
#include <asm/traps.h>
#include <asm/vdso.h>
@@ -196,6 +197,7 @@ void do_timer_interrupt(struct pt_regs *regs, int fault_num)
arch_local_irq_mask(INT_TILE_TIMER);

/* Track time spent here in an interrupt context */
+ task_isolation_interrupt("timer");
irq_enter();

/* Track interrupt count. */
diff --git a/arch/tile/kernel/unaligned.c b/arch/tile/kernel/unaligned.c
index 8149c38f67b6..ee14a61d0eee 100644
--- a/arch/tile/kernel/unaligned.c
+++ b/arch/tile/kernel/unaligned.c
@@ -27,6 +27,7 @@
#include <linux/extable.h>
#include <linux/compat.h>
#include <linux/prctl.h>
+#include <linux/isolation.h>
#include <asm/cacheflush.h>
#include <asm/traps.h>
#include <linux/uaccess.h>
@@ -1547,6 +1548,9 @@ void do_unaligned(struct pt_regs *regs, int vecnum)
return;
}

+ /* No signal was generated, but notify task-isolation tasks. */
+ task_isolation_interrupt("unaligned access at %#lx", regs->pc);
+
if (!info->unalign_jit_base) {
void __user *user_page;

diff --git a/arch/tile/mm/fault.c b/arch/tile/mm/fault.c
index f58fa06a2214..0a7a731d7413 100644
--- a/arch/tile/mm/fault.c
+++ b/arch/tile/mm/fault.c
@@ -38,6 +38,7 @@
#include <linux/syscalls.h>
#include <linux/uaccess.h>
#include <linux/kdebug.h>
+#include <linux/isolation.h>

#include <asm/pgalloc.h>
#include <asm/sections.h>
@@ -311,8 +312,13 @@ static int handle_page_fault(struct pt_regs *regs,
*/
pgd = get_current_pgd();
if (handle_migrating_pte(pgd, fault_num, address, regs->pc,
- is_kernel_mode, write))
+ is_kernel_mode, write)) {
+ /* No signal was generated, but notify task-isolation tasks. */
+ if (!is_kernel_mode)
+ task_isolation_interrupt("migration fault at %#lx",
+ address);
return 1;
+ }

si_code = SEGV_MAPERR;

@@ -482,6 +488,11 @@ static int handle_page_fault(struct pt_regs *regs,
#endif

up_read(&mm->mmap_sem);
+
+ /* No signal was generated, but notify task-isolation tasks. */
+ if (flags & FAULT_FLAG_USER)
+ task_isolation_interrupt("page fault at %#lx", address);
+
return 1;

/*
diff --git a/arch/tile/mm/homecache.c b/arch/tile/mm/homecache.c
index b51cc28acd0a..82bcffb26ec4 100644
--- a/arch/tile/mm/homecache.c
+++ b/arch/tile/mm/homecache.c
@@ -31,6 +31,7 @@
#include <linux/smp.h>
#include <linux/module.h>
#include <linux/hugetlb.h>
+#include <linux/isolation.h>

#include <asm/page.h>
#include <asm/sections.h>
@@ -85,6 +86,16 @@ static void hv_flush_update(const struct cpumask *cache_cpumask,
*/
for_each_cpu(cpu, &mask)
++per_cpu(irq_stat, cpu).irq_hv_flush_count;
+
+ /*
+ * Send an extra Linux-level interrupt here to task isolation tasks
+ * so that they can generate the appropriate signal to kill
+ * themselves. If we didn't send a Linux interrupt, we would
+ * interrupt the task into the hypervisor but it wouldn't get a
+ * signal to advise it of the loss of the cpu.
+ */
+ task_isolation_remote_cpumask_interrupt(&mask,
+ "remote cache/TLB flush");
}

/*
--
2.1.2


From 1583204214205413228@xxx Sun Nov 05 05:53:16 +0000 2017
X-GM-THRID: 1583204214205413228
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread

2017-11-03 17:09:25

by Chris Metcalf

[permalink] [raw]
Subject: [PATCH v16 13/13] task_isolation self test

This code tests various aspects of task_isolation.

Signed-off-by: Chris Metcalf <[email protected]>
---
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/task_isolation/Makefile | 6 +
tools/testing/selftests/task_isolation/config | 1 +
tools/testing/selftests/task_isolation/isolation.c | 643 +++++++++++++++++++++
4 files changed, 651 insertions(+)
create mode 100644 tools/testing/selftests/task_isolation/Makefile
create mode 100644 tools/testing/selftests/task_isolation/config
create mode 100644 tools/testing/selftests/task_isolation/isolation.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index ff805643b5f7..ab781b99d3c9 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -30,6 +30,7 @@ TARGETS += splice
TARGETS += static_keys
TARGETS += sync
TARGETS += sysctl
+TARGETS += task_isolation
ifneq (1, $(quicktest))
TARGETS += timers
endif
diff --git a/tools/testing/selftests/task_isolation/Makefile b/tools/testing/selftests/task_isolation/Makefile
new file mode 100644
index 000000000000..74d060b493f9
--- /dev/null
+++ b/tools/testing/selftests/task_isolation/Makefile
@@ -0,0 +1,6 @@
+CFLAGS += -O2 -g -W -Wall
+LDFLAGS += -pthread
+
+TEST_GEN_PROGS := isolation
+
+include ../lib.mk
diff --git a/tools/testing/selftests/task_isolation/config b/tools/testing/selftests/task_isolation/config
new file mode 100644
index 000000000000..34edfbca0423
--- /dev/null
+++ b/tools/testing/selftests/task_isolation/config
@@ -0,0 +1 @@
+CONFIG_TASK_ISOLATION=y
diff --git a/tools/testing/selftests/task_isolation/isolation.c b/tools/testing/selftests/task_isolation/isolation.c
new file mode 100644
index 000000000000..9c0b49619b40
--- /dev/null
+++ b/tools/testing/selftests/task_isolation/isolation.c
@@ -0,0 +1,643 @@
+/*
+ * This test program tests the features of task isolation.
+ *
+ * - Makes sure enabling task isolation fails if you are unaffinitized
+ * or on a non-task-isolation cpu.
+ *
+ * - Validates that various synchronous exceptions are fatal in isolation
+ * mode:
+ *
+ * * Page fault
+ * * System call
+ * * TLB invalidation from another thread [1]
+ * * Unaligned access [2]
+ *
+ * - Tests that taking a user-defined signal for the above faults works.
+ *
+ * - Tests that you can prctl(PR_TASK_ISOLATION, 0) to turn isolation off.
+ *
+ * - Tests that receiving a signal turns isolation off.
+ *
+ * - Tests that having another process schedule into the core where the
+ * isolation process is running correctly kills the isolation process.
+ *
+ * [1] TLB invalidations do not cause IPIs on some platforms, e.g. arm64
+ * [2] Unaligned access only causes exceptions on some platforms, e.g. tile
+ *
+ *
+ * You must be running under a kernel configured with TASK_ISOLATION.
+ *
+ * You must have booted with e.g. "nohz_full=1-15 isolcpus=1-15" to
+ * enable some task-isolation cores. If you get interrupt reports, you
+ * can also add the boot argument "task_isolation_debug" to learn more.
+ * If you get jitter but no reports, define DEBUG_TASK_ISOLATION to add
+ * isolation checks in every user_exit() call.
+ *
+ * NOTE: you must disable the code in tick_nohz_stop_sched_tick()
+ * that limits the tick delta to the maximum scheduler deferment
+ * by making it conditional not just on "!ts->inidle" but also
+ * on !current->task_isolation_flags. This is around line 756
+ * in kernel/time/tick-sched.c (as of kernel 4.14).
+ *
+ *
+ * To compile the test program, run "make".
+ *
+ * Run the program as "./isolation" and if you want to run the
+ * jitter-detection loop for longer than 10 giga-cycles, specify the
+ * number of giga-cycles to run it for as a command-line argument.
+ */
+
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <stdbool.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <fcntl.h>
+#include <assert.h>
+#include <string.h>
+#include <errno.h>
+#include <sched.h>
+#include <pthread.h>
+#include <sys/wait.h>
+#include <sys/mman.h>
+#include <sys/time.h>
+#include <sys/prctl.h>
+#include <sys/types.h>
+#include "../kselftest.h"
+
+#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]))
+#define READ_ONCE(x) (*(volatile typeof(x) *)&(x))
+#define WRITE_ONCE(x, val) (*(volatile typeof(x) *)&(x) = (val))
+
+#ifndef PR_TASK_ISOLATION /* Not in system headers yet? */
+# define PR_TASK_ISOLATION 48
+# define PR_TASK_ISOLATION_ENABLE (1 << 0)
+# define PR_TASK_ISOLATION_SET_SIG(sig) (((sig) & 0x7f) << 8)
+# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f)
+#endif
+
+/* The cpu we are using for isolation tests. */
+static int task_isolation_cpu;
+
+/* Overall status, maintained as tests run. */
+static int exit_status = KSFT_PASS;
+
+/* Data shared between parent and children. */
+static struct {
+ /* Set to true when the parent's isolation prctl is successful. */
+ bool parent_isolated;
+} *shared;
+
+/* Set affinity to a single cpu or die if trying to do so fails. */
+static void set_my_cpu(int cpu)
+{
+ cpu_set_t set;
+ int rc;
+
+ CPU_ZERO(&set);
+ CPU_SET(cpu, &set);
+ rc = sched_setaffinity(0, sizeof(cpu_set_t), &set);
+ assert(rc == 0);
+}
+
+#define timeout_init(tv) gettimeofday((tv), NULL)
+
+static int timeout(struct timeval *start, double secs)
+{
+ struct timeval tv;
+ double time;
+
+ gettimeofday(&tv, NULL);
+ time = (tv.tv_sec - start->tv_sec) +
+ (tv.tv_usec - start->tv_usec) / 1000000.0;
+ return time >= secs;
+}
+
+static inline int prctl_isolation(int flags)
+{
+ return prctl(PR_TASK_ISOLATION, flags, 0, 0, 0);
+}
+
+static void set_task_isolation(int flags)
+{
+ int rc;
+ struct timeval start;
+
+ /* Wait for up to a second for the kernel to isolate this core. */
+ timeout_init(&start);
+ do {
+ rc = prctl_isolation(flags);
+ if (rc == 0 || errno != EAGAIN)
+ break;
+ } while (!timeout(&start, 1));
+ if (rc != 0) {
+ prctl_isolation(0);
+ printf("couldn't enable isolation (%d): FAIL\n", errno);
+ ksft_exit_fail();
+ }
+}
+
+/*
+ * Run a child process in task isolation mode and report its status.
+ * The child does mlockall() and moves itself to the task isolation cpu.
+ * It then runs SETUP_FUNC (if specified), calls prctl(PR_TASK_ISOLATION)
+ * with FLAGS (if non-zero), and then invokes TEST_FUNC and exits
+ * with its status.
+ */
+static int run_test(void (*setup_func)(), int (*test_func)(), int flags)
+{
+ int pid, rc, status;
+
+ fflush(stdout);
+ pid = fork();
+ assert(pid >= 0);
+ if (pid != 0) {
+ /* In parent; wait for child and return its status. */
+ waitpid(pid, &status, 0);
+ return status;
+ }
+
+ /* In child. */
+ rc = mlockall(MCL_CURRENT);
+ assert(rc == 0);
+ set_my_cpu(task_isolation_cpu);
+ if (setup_func)
+ setup_func();
+ if (flags)
+ set_task_isolation(flags);
+ rc = test_func();
+ exit(rc);
+}
+
+/* Run a test and make sure it exits with success. */
+static void test_ok(const char *testname, void (*setup_func)(),
+ int (*test_func)())
+{
+ int status;
+
+ status = run_test(setup_func, test_func, PR_TASK_ISOLATION_ENABLE);
+ if (status == KSFT_PASS) {
+ printf("%s: OK\n", testname);
+ } else {
+ printf("%s: FAIL (%#x)\n", testname, status);
+ exit_status = KSFT_FAIL;
+ }
+}
+
+/*
+ * Run a test and ensure it is killed with SIGKILL by default,
+ * for whatever misdemeanor is committed in TEST_FUNC.
+ * Also test it with SIGUSR1 as well to make sure that works.
+ */
+static void test_killed(const char *testname, void (*setup_func)(),
+ int (*test_func)())
+{
+ int status;
+
+ status = run_test(setup_func, test_func, PR_TASK_ISOLATION_ENABLE);
+ if (WIFSIGNALED(status) && WTERMSIG(status) == SIGKILL) {
+ printf("%s: OK\n", testname);
+ } else {
+ printf("%s: FAIL (%#x)\n", testname, status);
+ exit_status = KSFT_FAIL;
+ }
+
+ status = run_test(setup_func, test_func,
+ PR_TASK_ISOLATION_ENABLE |
+ PR_TASK_ISOLATION_SET_SIG(SIGUSR1));
+ if (WIFSIGNALED(status) && WTERMSIG(status) == SIGUSR1) {
+ printf("%s (SIGUSR1): OK\n", testname);
+ } else {
+ printf("%s (SIGUSR1): FAIL (%#x)\n", testname, status);
+ exit_status = KSFT_FAIL;
+ }
+}
+
+/* Mapping address passed from setup function to test function. */
+static char *fault_file_mapping;
+
+/* mmap() a file in so we can test touching an unmapped page. */
+static void setup_fault(void)
+{
+ char fault_file[] = "/tmp/isolation_XXXXXX";
+ int fd, rc;
+
+ fd = mkstemp(fault_file);
+ assert(fd >= 0);
+ rc = ftruncate(fd, getpagesize());
+ assert(rc == 0);
+ fault_file_mapping = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE,
+ MAP_SHARED, fd, 0);
+ assert(fault_file_mapping != MAP_FAILED);
+ close(fd);
+ unlink(fault_file);
+}
+
+/* Now touch the unmapped page (and be killed). */
+static int do_fault(void)
+{
+ *fault_file_mapping = 1;
+ return KSFT_FAIL;
+}
+
+/* Make a syscall (and be killed). */
+static int do_syscall(void)
+{
+ static const char *msg = "goodbye, world\n";
+ int rc;
+
+ rc = write(STDOUT_FILENO, msg, strlen(msg));
+ assert(rc == (int)strlen(msg));
+ return KSFT_FAIL;
+}
+
+/* Turn isolation back off and don't be killed. */
+static int do_syscall_off(void)
+{
+ const char *msg = "==> hello, world\n";
+ int rc;
+
+ prctl_isolation(0);
+ rc = write(STDOUT_FILENO, msg, strlen(msg));
+ assert(rc == (int)strlen(msg));
+ return KSFT_PASS;
+}
+
+/* Test that isolation is off in signal handlers. */
+static void segv_handler(int sig)
+{
+ printf("Received signal %d successfully\n", sig);
+ exit(0);
+}
+
+static void setup_segv(void)
+{
+ signal(SIGSEGV, segv_handler);
+}
+
+static int do_segv(void)
+{
+ *(volatile int *)0 = 0;
+ return KSFT_FAIL; /* should not get here */
+}
+
+/* ARM64 uses tlbi instructions so doesn't need to interrupt the remote core. */
+#ifndef __aarch64__
+#define TEST_MUNMAP
+
+/*
+ * Fork a thread that will munmap() after a short while.
+ * It will deliver a TLB flush to the task isolation core.
+ */
+
+static void *start_munmap(void *p)
+{
+ usleep(500000); /* 0.5s */
+ munmap(p, getpagesize());
+ return 0;
+}
+
+static void setup_munmap(void)
+{
+ pthread_t thr;
+ void *p;
+ int rc;
+
+ /* First, go back to cpu 0 and allocate some memory. */
+ set_my_cpu(0);
+ p = mmap(0, getpagesize(), PROT_READ|PROT_WRITE,
+ MAP_ANONYMOUS|MAP_POPULATE|MAP_PRIVATE, 0, 0);
+ assert(p != MAP_FAILED);
+
+ /*
+ * Now fire up a thread that will wait half a second on cpu 0
+ * and then munmap the mapping.
+ */
+ rc = pthread_create(&thr, NULL, start_munmap, p);
+ assert(rc == 0);
+
+ /* Back to the task-isolation cpu. */
+ set_my_cpu(task_isolation_cpu);
+}
+
+/* Global variable to avoid the compiler outsmarting us. */
+int munmap_spin;
+
+static int do_munmap(void)
+{
+ while (munmap_spin < 1000000000)
+ WRITE_ONCE(munmap_spin, munmap_spin + 1);
+ return KSFT_FAIL;
+}
+#endif
+
+#ifdef __tilegx__
+#define TEST_UNALIGNED
+
+/*
+ * Make an unaligned access (and be killed).
+ * Only for tilegx, since other platforms don't do in-kernel fixups.
+ */
+static int do_unaligned(void)
+{
+ static int buf[2];
+ int *addr = (int *)((char *)buf + 1);
+
+ READ_ONCE(*addr);
+
+ asm("nop");
+ return KSFT_FAIL;
+}
+#endif
+
+/*
+ * Test to make sure that if a process is scheduled to run on the same core
+ * as a task isolation process, the task isolation process will be signalled.
+ */
+
+static void setup_schedule(void)
+{
+ struct timeval start;
+ int rc, child_pid;
+
+ /*
+ * First, go back to cpu 0 to ensure that the child we create here
+ * doesn't race with task isolation by the parent in do_schedule().
+ */
+ set_my_cpu(0);
+
+ /* Fork and fault in all memory in both. */
+ child_pid = fork();
+ assert(child_pid >= 0);
+ rc = mlockall(MCL_CURRENT);
+ assert(rc == 0);
+ if (child_pid != 0) {
+ /* Send parent back to the task isolation cpu. */
+ set_my_cpu(task_isolation_cpu);
+ return;
+ }
+
+ /*
+ * In child. Wait until parent notifies us that it has completed
+ * its prctl, then reschedule onto its cpu.
+ */
+ timeout_init(&start);
+ while (!shared->parent_isolated) {
+ if (timeout(&start, 1)) {
+ printf("child: no parent post-prctl\n");
+ exit(1);
+ }
+ }
+ set_my_cpu(task_isolation_cpu);
+
+ /*
+ * We are now running on the task isolation cpu, which should have
+ * killed the parent by forcing the OS to run the scheduler. No
+ * need to run any code beyond just invoking exit().
+ */
+ exit(0);
+}
+
+static int do_schedule(void)
+{
+ struct timeval start;
+
+ /* Notify the child to switch to our cpu. */
+ shared->parent_isolated = true;
+
+ /*
+ * Wait for child to come disturb us. Note that we require
+ * gettimeofday() to run via vDSO for this waiting idiom to work;
+ * if it were a syscall we would get signalled just for calling it.
+ */
+ timeout_init(&start);
+ while (!timeout(&start, 5))
+ ;
+ printf("parent: no interrupt from scheduler\n");
+ exit(1);
+}
+
+#ifdef __tile__
+#include <arch/spr_def.h>
+#endif
+
+static inline u_int64_t get_cycle_count(void)
+{
+#ifdef __x86_64__
+ unsigned int lower, upper;
+
+ asm volatile("rdtsc" : "=a"(lower), "=d"(upper));
+ return lower | ((unsigned long)upper << 32);
+#elif defined(__tile__)
+ return __insn_mfspr(SPR_CYCLE);
+#elif defined(__aarch64__)
+ unsigned long vtick;
+
+ asm volatile("mrs %0, cntvct_el0" : "=r" (vtick));
+ return vtick;
+#elif defined(__arm__)
+ u_int64_t cval;
+
+ asm volatile("mrrc p15, 1, %Q0, %R0, c14" : "=r" (cval));
+ return cval;
+#else
+#error Unsupported architecture
+#endif
+}
+
+/*
+ * Histogram of cycle counts up to HISTSIZE cycles. Any loop that takes
+ * more than this number of cycles is considered an error, since we likely
+ * entered the kernel (rather than just, say, having to refetch caches
+ * lines or equivalent).
+ */
+#define HISTSIZE 600
+static long hist[HISTSIZE];
+
+/* Information on loss of control of the cpu (more than HISTSIZE cycles). */
+struct jitter_info {
+ unsigned long at; /* cycle of jitter event */
+ long cycles; /* how long we lost the cpu for */
+};
+#define MAX_EVENTS 100
+static struct jitter_info jitter[MAX_EVENTS];
+static unsigned int count; /* index into jitter[] */
+
+static void jitter_summarize(void)
+{
+ unsigned int i;
+
+ printf("INFO: loop times:\n");
+ for (i = 0; i < HISTSIZE; ++i)
+ if (hist[i])
+ printf(" %d cycles (count: %ld)\n", i, hist[i]);
+
+ if (count)
+ printf("ERROR: jitter:\n");
+ for (i = 0; i < count; ++i)
+ printf(" %ld: %ld cycles\n", jitter[i].at, jitter[i].cycles);
+ if (count == ARRAY_SIZE(jitter))
+ printf(" ... more\n");
+}
+
+/*
+ * A DEBUG_TASK_ISOLATION kernel will issue a SIGUSR right after this
+ * variable is set, during the prctl_isolation(0). Handle that case
+ * by checking the variable and basically ignoring the signal.
+ */
+static bool jitter_test_complete;
+
+static void jitter_handler(int sig)
+{
+ if (jitter_test_complete)
+ return;
+
+ if (sig == SIGUSR1) {
+ exit_status = KSFT_FAIL;
+ printf("ERROR: Program unexpectedly entered kernel.\n");
+ }
+ jitter_summarize();
+ exit(exit_status);
+}
+
+static void test_jitter(unsigned long waitticks)
+{
+ u_int64_t start, last, elapsed;
+ int rc;
+
+ printf("testing task isolation jitter for %ld ticks\n", waitticks);
+
+ signal(SIGINT, jitter_handler);
+ signal(SIGUSR1, jitter_handler);
+ set_my_cpu(task_isolation_cpu);
+ rc = mlockall(MCL_CURRENT);
+ assert(rc == 0);
+
+ set_task_isolation(PR_TASK_ISOLATION_ENABLE |
+ PR_TASK_ISOLATION_SET_SIG(SIGUSR1));
+
+ last = start = get_cycle_count();
+ do {
+ u_int64_t next = get_cycle_count();
+ u_int64_t delta = next - last;
+
+ elapsed = next - start;
+ if (__builtin_expect(delta > HISTSIZE, 0)) {
+ exit_status = KSFT_FAIL;
+ if (count < ARRAY_SIZE(jitter)) {
+ jitter[count].cycles = delta;
+ jitter[count].at = elapsed;
+ WRITE_ONCE(count, count + 1);
+ }
+ } else {
+ hist[delta]++;
+ }
+ last = next;
+
+ } while (elapsed < waitticks);
+
+ jitter_test_complete = true;
+ prctl_isolation(0);
+ jitter_summarize();
+}
+
+int main(int argc, char **argv)
+{
+ /* How many billion ticks to wait after running the other tests? */
+ unsigned long waitticks;
+ char buf[100];
+ char *result, *end;
+ FILE *f;
+
+ if (argc == 1)
+ waitticks = 10;
+ else if (argc == 2)
+ waitticks = strtol(argv[1], NULL, 10);
+ else {
+ printf("syntax: isolation [gigaticks]\n");
+ ksft_exit_fail();
+ }
+ waitticks *= 1000000000;
+
+ /* Get a core from the /sys nohz_full device. */
+ f = fopen("/sys/devices/system/cpu/nohz_full", "r");
+ if (f == NULL)
+ ksft_exit_skip("/sys nohz_full: SKIP (%s)\n", strerror(errno));
+ result = fgets(buf, sizeof(buf), f);
+ assert(result == buf);
+ fclose(f);
+ if (*buf == '\n')
+ ksft_exit_skip("No nohz_full cores configured.\n");
+ task_isolation_cpu = strtol(buf, &end, 10);
+ assert(end != buf);
+ assert(*end == ',' || *end == '-' || *end == '\n');
+ assert(task_isolation_cpu >= 0);
+
+ /* Make sure it matches the first core from the /sys isolated device. */
+ f = fopen("/sys/devices/system/cpu/isolated", "r");
+ if (f == NULL)
+ ksft_exit_skip("/sys isolated: SKIP (%s)\n", strerror(errno));
+ result = fgets(buf, sizeof(buf), f);
+ assert(result == buf);
+ fclose(f);
+ if (*buf == '\n')
+ ksft_exit_skip("No isolated cores configured.\n");
+ if (task_isolation_cpu != strtol(buf, &end, 10))
+ ksft_exit_skip("Isolated and nohz_full cores don't match.\n");
+ assert(end != buf);
+ assert(*end == ',' || *end == '-' || *end == '\n');
+
+ printf("/sys devices: OK (using task isolation cpu %d)\n",
+ task_isolation_cpu);
+
+ /* Test to see if with no mask set, we fail. */
+ if (prctl_isolation(PR_TASK_ISOLATION_ENABLE) == 0 ||
+ errno != EINVAL) {
+ printf("prctl unaffinitized: FAIL\n");
+ exit_status = KSFT_FAIL;
+ } else {
+ printf("prctl unaffinitized: OK\n");
+ }
+
+ /* Or if affinitized to the wrong cpu. */
+ set_my_cpu(0);
+ if (prctl_isolation(PR_TASK_ISOLATION_ENABLE) == 0 ||
+ errno != EINVAL) {
+ printf("prctl on cpu 0: FAIL\n");
+ exit_status = KSFT_FAIL;
+ } else {
+ printf("prctl on cpu 0: OK\n");
+ }
+
+ /* Allocate some memory to be used for parent/child communication. */
+ shared = mmap(0, getpagesize(), PROT_READ|PROT_WRITE,
+ MAP_ANONYMOUS|MAP_SHARED, 0, 0);
+ assert(shared != MAP_FAILED);
+
+ /* Run the positive tests. */
+#ifndef DEBUG_TASK_ISOLATION
+ test_ok("test_off", NULL, do_syscall_off);
+ test_ok("test_segv", setup_segv, do_segv);
+#endif
+
+ /* Run the negative tests. */
+ test_killed("test_fault", setup_fault, do_fault);
+ test_killed("test_syscall", NULL, do_syscall);
+#ifdef TEST_MUNMAP
+ test_killed("test_munmap", setup_munmap, do_munmap);
+#endif
+#ifdef TEST_UNALIGNED
+ test_killed("test_unaligned", NULL, do_unaligned);
+#endif
+ test_killed("test_schedule", setup_schedule, do_schedule);
+
+ /* Exit failure if any test failed. */
+ if (exit_status != KSFT_PASS) {
+ printf("Skipping jitter testing due to test failures\n");
+ return exit_status;
+ }
+
+ test_jitter(waitticks);
+
+ return exit_status;
+}
--
2.1.2


From 1583056903238013601@xxx Fri Nov 03 14:51:50 +0000 2017
X-GM-THRID: 1583056903238013601
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread

2017-11-03 17:07:02

by Chris Metcalf

[permalink] [raw]
Subject: [PATCH v16 02/13] vmstat: add vmstat_idle function

This function checks to see if a vmstat worker is not running,
and the vmstat diffs don't require an update. The function is
called from the task-isolation code to see if we need to
actually do some work to quiet vmstat.

Acked-by: Christoph Lameter <[email protected]>
Signed-off-by: Chris Metcalf <[email protected]>
---
include/linux/vmstat.h | 2 ++
mm/vmstat.c | 10 ++++++++++
2 files changed, 12 insertions(+)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index e0b504594593..80212a952448 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -265,6 +265,7 @@ extern void __dec_node_state(struct pglist_data *, enum node_stat_item);

void quiet_vmstat(void);
void quiet_vmstat_sync(void);
+bool vmstat_idle(void);
void cpu_vm_stats_fold(int cpu);
void refresh_zone_stat_thresholds(void);

@@ -368,6 +369,7 @@ static inline void refresh_zone_stat_thresholds(void) { }
static inline void cpu_vm_stats_fold(int cpu) { }
static inline void quiet_vmstat(void) { }
static inline void quiet_vmstat_sync(void) { }
+static inline bool vmstat_idle(void) { return true; }

static inline void drain_zonestat(struct zone *zone,
struct per_cpu_pageset *pset) { }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 8ad1b84ca9cf..8b13a6ca494c 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1846,6 +1846,16 @@ void quiet_vmstat_sync(void)
}

/*
+ * Report on whether vmstat processing is quiesced on the core currently:
+ * no vmstat worker running and no vmstat updates to perform.
+ */
+bool vmstat_idle(void)
+{
+ return !delayed_work_pending(this_cpu_ptr(&vmstat_work)) &&
+ !need_update(smp_processor_id());
+}
+
+/*
* Shepherd worker thread that checks the
* differentials of processors that have their worker
* threads for vm statistics updates disabled because of
--
2.1.2


From 1583055175112208329@xxx Fri Nov 03 14:24:21 +0000 2017
X-GM-THRID: 1583055175112208329
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread

2017-11-03 17:11:33

by Chris Metcalf

[permalink] [raw]
Subject: [PATCH v16 07/13] Add task isolation hooks to arch-independent code

This commit adds task isolation hooks as follows:

- __handle_domain_irq() generates an isolation warning for the
local task

- irq_work_queue_on() generates an isolation warning for the remote
task being interrupted for irq_work

- generic_exec_single() generates a remote isolation warning for
the remote cpu being IPI'd

- smp_call_function_many() generates a remote isolation warning for
the set of remote cpus being IPI'd

Calls to task_isolation_remote() or task_isolation_interrupt() can
be placed in the platform-independent code like this when doing so
results in fewer lines of code changes, as for example is true of
the users of the arch_send_call_function_*() APIs. Or, they can
be placed in the per-architecture code when there are many callers,
as for example is true of the smp_send_reschedule() call.

A further cleanup might be to create an intermediate layer, so that
for example smp_send_reschedule() is a single generic function that
just calls arch_smp_send_reschedule(), allowing generic code to be
called every time smp_send_reschedule() is invoked. But for now,
we just update either callers or callees as makes most sense.

Signed-off-by: Chris Metcalf <[email protected]>
---
kernel/irq/irqdesc.c | 5 +++++
kernel/irq_work.c | 5 ++++-
kernel/smp.c | 6 +++++-
3 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index 82afb7ed369f..1b114c6b7ab8 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -16,6 +16,7 @@
#include <linux/bitmap.h>
#include <linux/irqdomain.h>
#include <linux/sysfs.h>
+#include <linux/isolation.h>

#include "internals.h"

@@ -633,6 +634,10 @@ int __handle_domain_irq(struct irq_domain *domain, unsigned int hwirq,
irq = irq_find_mapping(domain, hwirq);
#endif

+ task_isolation_interrupt((irq == hwirq) ?
+ "irq %d (%s)" : "irq %d (%s hwirq %d)",
+ irq, domain ? domain->name : "", hwirq);
+
/*
* Some hardware gives randomly wrong interrupts. Rather
* than crashing, do something sensible.
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index bcf107ce0854..cde49f1f31f7 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -17,6 +17,7 @@
#include <linux/cpu.h>
#include <linux/notifier.h>
#include <linux/smp.h>
+#include <linux/isolation.h>
#include <asm/processor.h>


@@ -75,8 +76,10 @@ bool irq_work_queue_on(struct irq_work *work, int cpu)
if (!irq_work_claim(work))
return false;

- if (llist_add(&work->llnode, &per_cpu(raised_list, cpu)))
+ if (llist_add(&work->llnode, &per_cpu(raised_list, cpu))) {
+ task_isolation_remote(cpu, "irq_work");
arch_send_call_function_single_ipi(cpu);
+ }

return true;
}
diff --git a/kernel/smp.c b/kernel/smp.c
index c94dd85c8d41..44252aa650ac 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -19,6 +19,7 @@
#include <linux/sched.h>
#include <linux/sched/idle.h>
#include <linux/hypervisor.h>
+#include <linux/isolation.h>

#include "smpboot.h"

@@ -175,8 +176,10 @@ static int generic_exec_single(int cpu, call_single_data_t *csd,
* locking and barrier primitives. Generic code isn't really
* equipped to do the right thing...
*/
- if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu)))
+ if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu))) {
+ task_isolation_remote(cpu, "IPI function");
arch_send_call_function_single_ipi(cpu);
+ }

return 0;
}
@@ -458,6 +461,7 @@ void smp_call_function_many(const struct cpumask *mask,
}

/* Send a message to all CPUs in the map */
+ task_isolation_remote_cpumask(cfd->cpumask_ipi, "IPI function");
arch_send_call_function_ipi_mask(cfd->cpumask_ipi);

if (wait) {
--
2.1.2


From 1583053902693213398@xxx Fri Nov 03 14:04:08 +0000 2017
X-GM-THRID: 1583053902693213398
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread

2017-11-03 17:06:40

by Chris Metcalf

[permalink] [raw]
Subject: [PATCH v16 01/13] vmstat: add quiet_vmstat_sync function

In commit f01f17d3705b ("mm, vmstat: make quiet_vmstat lighter")
the quiet_vmstat() function became asynchronous, in the sense that
the vmstat work was still scheduled to run on the core when the
function returned. For task isolation, we need a synchronous
version of the function that guarantees that the vmstat worker
will not run on the core on return from the function. Add a
quiet_vmstat_sync() function with that semantic.

Signed-off-by: Chris Metcalf <[email protected]>
---
include/linux/vmstat.h | 2 ++
mm/vmstat.c | 9 +++++++++
2 files changed, 11 insertions(+)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index ade7cb5f1359..e0b504594593 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -264,6 +264,7 @@ extern void __dec_zone_state(struct zone *, enum zone_stat_item);
extern void __dec_node_state(struct pglist_data *, enum node_stat_item);

void quiet_vmstat(void);
+void quiet_vmstat_sync(void);
void cpu_vm_stats_fold(int cpu);
void refresh_zone_stat_thresholds(void);

@@ -366,6 +367,7 @@ static inline void __dec_node_page_state(struct page *page,
static inline void refresh_zone_stat_thresholds(void) { }
static inline void cpu_vm_stats_fold(int cpu) { }
static inline void quiet_vmstat(void) { }
+static inline void quiet_vmstat_sync(void) { }

static inline void drain_zonestat(struct zone *zone,
struct per_cpu_pageset *pset) { }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 4bb13e72ac97..8ad1b84ca9cf 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1837,6 +1837,15 @@ void quiet_vmstat(void)
}

/*
+ * Synchronously quiet vmstat so the work is guaranteed not to run on return.
+ */
+void quiet_vmstat_sync(void)
+{
+ cancel_delayed_work_sync(this_cpu_ptr(&vmstat_work));
+ refresh_cpu_vm_stats(false);
+}
+
+/*
* Shepherd worker thread that checks the
* differentials of processors that have their worker
* threads for vm statistics updates disabled because of
--
2.1.2


From 1583323516720220058@xxx Mon Nov 06 13:29:32 +0000 2017
X-GM-THRID: 1583323516720220058
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread

2017-11-03 17:10:53

by Chris Metcalf

[permalink] [raw]
Subject: [PATCH v16 09/13] arch/arm64: enable task isolation functionality

In do_notify_resume(), call task_isolation_start() for
TIF_TASK_ISOLATION tasks. Add _TIF_TASK_ISOLATION to _TIF_WORK_MASK,
and define a local NOTIFY_RESUME_LOOP_FLAGS to check in the loop,
since we don't clear _TIF_TASK_ISOLATION in the loop.

We tweak syscall_trace_enter() slightly to carry the "flags"
value from current_thread_info()->flags for each of the tests,
rather than doing a volatile read from memory for each one. This
avoids a small overhead for each test, and in particular avoids
that overhead for TIF_NOHZ when TASK_ISOLATION is not enabled.

We instrument the smp_send_reschedule() routine so that it checks for
isolated tasks and generates a suitable warning if needed.

Finally, report on page faults in task-isolation processes in
do_page_faults().

Signed-off-by: Chris Metcalf <[email protected]>
---
arch/arm64/Kconfig | 1 +
arch/arm64/include/asm/thread_info.h | 5 ++++-
arch/arm64/kernel/ptrace.c | 18 +++++++++++++++---
arch/arm64/kernel/signal.c | 10 +++++++++-
arch/arm64/kernel/smp.c | 7 +++++++
arch/arm64/mm/fault.c | 5 +++++
6 files changed, 41 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 0df64a6a56d4..d77ecdb29765 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -73,6 +73,7 @@ config ARM64
select HAVE_ARCH_MMAP_RND_BITS
select HAVE_ARCH_MMAP_RND_COMPAT_BITS if COMPAT
select HAVE_ARCH_SECCOMP_FILTER
+ select HAVE_ARCH_TASK_ISOLATION
select HAVE_ARCH_TRACEHOOK
select HAVE_ARCH_TRANSPARENT_HUGEPAGE
select HAVE_ARCH_VMAP_STACK
diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
index ddded6497a8a..9c749eca7384 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -82,6 +82,7 @@ void arch_setup_new_exec(void);
#define TIF_FOREIGN_FPSTATE 3 /* CPU's FP state is not current's */
#define TIF_UPROBE 4 /* uprobe breakpoint or singlestep */
#define TIF_FSCHECK 5 /* Check FS is USER_DS on return */
+#define TIF_TASK_ISOLATION 6
#define TIF_NOHZ 7
#define TIF_SYSCALL_TRACE 8
#define TIF_SYSCALL_AUDIT 9
@@ -97,6 +98,7 @@ void arch_setup_new_exec(void);
#define _TIF_NEED_RESCHED (1 << TIF_NEED_RESCHED)
#define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME)
#define _TIF_FOREIGN_FPSTATE (1 << TIF_FOREIGN_FPSTATE)
+#define _TIF_TASK_ISOLATION (1 << TIF_TASK_ISOLATION)
#define _TIF_NOHZ (1 << TIF_NOHZ)
#define _TIF_SYSCALL_TRACE (1 << TIF_SYSCALL_TRACE)
#define _TIF_SYSCALL_AUDIT (1 << TIF_SYSCALL_AUDIT)
@@ -108,7 +110,8 @@ void arch_setup_new_exec(void);

#define _TIF_WORK_MASK (_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
_TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE | \
- _TIF_UPROBE | _TIF_FSCHECK)
+ _TIF_UPROBE | _TIF_FSCHECK | \
+ _TIF_TASK_ISOLATION)

#define _TIF_SYSCALL_WORK (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \
_TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP | \
diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index 9cbb6123208f..e5c0d7cdaf4e 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -38,6 +38,7 @@
#include <linux/regset.h>
#include <linux/tracehook.h>
#include <linux/elf.h>
+#include <linux/isolation.h>

#include <asm/compat.h>
#include <asm/debug-monitors.h>
@@ -1371,14 +1372,25 @@ static void tracehook_report_syscall(struct pt_regs *regs,

asmlinkage int syscall_trace_enter(struct pt_regs *regs)
{
- if (test_thread_flag(TIF_SYSCALL_TRACE))
+ unsigned long work = READ_ONCE(current_thread_info()->flags);
+
+ if (work & _TIF_SYSCALL_TRACE)
tracehook_report_syscall(regs, PTRACE_SYSCALL_ENTER);

- /* Do the secure computing after ptrace; failures should be fast. */
+ /*
+ * In task isolation mode, we may prevent the syscall from
+ * running, and if so we also deliver a signal to the process.
+ */
+ if (work & _TIF_TASK_ISOLATION) {
+ if (task_isolation_syscall(regs->syscallno) == -1)
+ return -1;
+ }
+
+ /* Do the secure computing check early; failures should be fast. */
if (secure_computing(NULL) == -1)
return -1;

- if (test_thread_flag(TIF_SYSCALL_TRACEPOINT))
+ if (work & _TIF_SYSCALL_TRACEPOINT)
trace_sys_enter(regs, regs->syscallno);

audit_syscall_entry(regs->syscallno, regs->orig_x0, regs->regs[1],
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index 0bdc96c61bc0..d8f4904e992f 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -30,6 +30,7 @@
#include <linux/tracehook.h>
#include <linux/ratelimit.h>
#include <linux/syscalls.h>
+#include <linux/isolation.h>

#include <asm/debug-monitors.h>
#include <asm/elf.h>
@@ -741,6 +742,10 @@ static void do_signal(struct pt_regs *regs)
restore_saved_sigmask();
}

+#define NOTIFY_RESUME_LOOP_FLAGS \
+ (_TIF_NEED_RESCHED | _TIF_SIGPENDING | _TIF_NOTIFY_RESUME | \
+ _TIF_FOREIGN_FPSTATE | _TIF_UPROBE | _TIF_FSCHECK)
+
asmlinkage void do_notify_resume(struct pt_regs *regs,
unsigned int thread_flags)
{
@@ -777,5 +782,8 @@ asmlinkage void do_notify_resume(struct pt_regs *regs,

local_irq_disable();
thread_flags = READ_ONCE(current_thread_info()->flags);
- } while (thread_flags & _TIF_WORK_MASK);
+ } while (thread_flags & NOTIFY_RESUME_LOOP_FLAGS);
+
+ if (thread_flags & _TIF_TASK_ISOLATION)
+ task_isolation_start();
}
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index 9f7195a5773e..4159c40de3b4 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -40,6 +40,7 @@
#include <linux/of.h>
#include <linux/irq_work.h>
#include <linux/kexec.h>
+#include <linux/isolation.h>

#include <asm/alternative.h>
#include <asm/atomic.h>
@@ -818,6 +819,7 @@ void arch_send_call_function_single_ipi(int cpu)
#ifdef CONFIG_ARM64_ACPI_PARKING_PROTOCOL
void arch_send_wakeup_ipi_mask(const struct cpumask *mask)
{
+ task_isolation_remote_cpumask(mask, "wakeup IPI");
smp_cross_call(mask, IPI_WAKEUP);
}
#endif
@@ -879,6 +881,9 @@ void handle_IPI(int ipinr, struct pt_regs *regs)
__inc_irq_stat(cpu, ipi_irqs[ipinr]);
}

+ task_isolation_interrupt("IPI type %d (%s)", ipinr,
+ ipinr < NR_IPI ? ipi_types[ipinr] : "unknown");
+
switch (ipinr) {
case IPI_RESCHEDULE:
scheduler_ipi();
@@ -941,12 +946,14 @@ void handle_IPI(int ipinr, struct pt_regs *regs)

void smp_send_reschedule(int cpu)
{
+ task_isolation_remote(cpu, "reschedule IPI");
smp_cross_call(cpumask_of(cpu), IPI_RESCHEDULE);
}

#ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
void tick_broadcast(const struct cpumask *mask)
{
+ task_isolation_remote_cpumask(mask, "timer IPI");
smp_cross_call(mask, IPI_TIMER);
}
#endif
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index b64958b23a7f..bff2f84d5f4e 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -32,6 +32,7 @@
#include <linux/perf_event.h>
#include <linux/preempt.h>
#include <linux/hugetlb.h>
+#include <linux/isolation.h>

#include <asm/bug.h>
#include <asm/cmpxchg.h>
@@ -495,6 +496,10 @@ static int __kprobes do_page_fault(unsigned long addr, unsigned int esr,
*/
if (likely(!(fault & (VM_FAULT_ERROR | VM_FAULT_BADMAP |
VM_FAULT_BADACCESS)))) {
+ /* No signal was generated, but notify task-isolation tasks. */
+ if (user_mode(regs))
+ task_isolation_interrupt("page fault at %#lx", addr);
+
/*
* Major/minor page fault accounting is only done
* once. If we go through a retry, it is extremely
--
2.1.2


From 1583383689985423110@xxx Tue Nov 07 05:25:58 +0000 2017
X-GM-THRID: 1583383689985423110
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread

2017-11-03 17:08:01

by Chris Metcalf

[permalink] [raw]
Subject: [PATCH v16 12/13] arm, tile: turn off timer tick for oneshot_stopped state

When the schedule tick is disabled in tick_nohz_stop_sched_tick(),
we call hrtimer_cancel(), which eventually calls down into
__remove_hrtimer() and thus into hrtimer_force_reprogram().
That function's call to tick_program_event() detects that
we are trying to set the expiration to KTIME_MAX and calls
clockevents_switch_state() to set the state to ONESHOT_STOPPED,
and returns. See commit 8fff52fd5093 ("clockevents: Introduce
CLOCK_EVT_STATE_ONESHOT_STOPPED state") for more background.

However, by default the internal __clockevents_switch_state() code
doesn't have a "set_state_oneshot_stopped" function pointer for
the arm_arch_timer or tile clock_event_device structures, so that
code returns -ENOSYS, and we end up not setting the state, and more
importantly, we don't actually turn off the hardware timer.
As a result, the timer tick we were waiting for before is still
queued, and fires shortly afterwards, only to discover there was
nothing for it to do, at which point it quiesces.

The fix is to provide that function pointer field, and like the
other function pointers, have it just turn off the timer interrupt.
Any call to set a new timer interval will properly re-enable it.

This fix avoids a small performance hiccup for regular applications,
but for TASK_ISOLATION code, it fixes a potentially serious
kernel timer interruption to the time-sensitive application.

Signed-off-by: Chris Metcalf <[email protected]>
Acked-by: Daniel Lezcano <[email protected]>
---
arch/tile/kernel/time.c | 1 +
drivers/clocksource/arm_arch_timer.c | 2 ++
2 files changed, 3 insertions(+)

diff --git a/arch/tile/kernel/time.c b/arch/tile/kernel/time.c
index f74f10d827fa..afca6fe496c8 100644
--- a/arch/tile/kernel/time.c
+++ b/arch/tile/kernel/time.c
@@ -163,6 +163,7 @@ static DEFINE_PER_CPU(struct clock_event_device, tile_timer) = {
.set_next_event = tile_timer_set_next_event,
.set_state_shutdown = tile_timer_shutdown,
.set_state_oneshot = tile_timer_shutdown,
+ .set_state_oneshot_stopped = tile_timer_shutdown,
.tick_resume = tile_timer_shutdown,
};

diff --git a/drivers/clocksource/arm_arch_timer.c b/drivers/clocksource/arm_arch_timer.c
index fd4b7f684bd0..61ea7f907c56 100644
--- a/drivers/clocksource/arm_arch_timer.c
+++ b/drivers/clocksource/arm_arch_timer.c
@@ -722,6 +722,8 @@ static void __arch_timer_setup(unsigned type,
}
}

+ clk->set_state_oneshot_stopped = clk->set_state_shutdown;
+
clk->set_state_shutdown(clk);

clockevents_config_and_register(clk, arch_timer_rate, 0xf, 0x7fffffff);
--
2.1.2


From 1582923605583912172@xxx Thu Nov 02 03:33:07 +0000 2017
X-GM-THRID: 1582873868372186384
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread

2017-11-03 17:08:44

by Chris Metcalf

[permalink] [raw]
Subject: [PATCH v16 05/13] Add try_stop_full_tick() API for NO_HZ_FULL

This API checks to see if the scheduler tick can be stopped,
and if so, stops it and returns 0; otherwise it returns an error.
This is intended for use with task isolation, where we will want to
be able to stop the tick synchronously when returning to userspace.

Signed-off-by: Chris Metcalf <[email protected]>
---
include/linux/tick.h | 1 +
kernel/time/tick-sched.c | 18 ++++++++++++++++++
2 files changed, 19 insertions(+)

diff --git a/include/linux/tick.h b/include/linux/tick.h
index fe01e68bf520..078ff2464b00 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -234,6 +234,7 @@ static inline void tick_dep_clear_signal(struct signal_struct *signal,

extern void tick_nohz_full_kick_cpu(int cpu);
extern void __tick_nohz_task_switch(void);
+extern int try_stop_full_tick(void);
#else
static inline int housekeeping_any_cpu(void)
{
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index c7a899c5ce64..c026145eba2f 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -861,6 +861,24 @@ static void tick_nohz_full_update_tick(struct tick_sched *ts)
#endif
}

+#ifdef CONFIG_TASK_ISOLATION
+int try_stop_full_tick(void)
+{
+ int cpu = smp_processor_id();
+ struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
+
+ /* For an unstable clock, we should return a permanent error code. */
+ if (atomic_read(&tick_dep_mask) & TICK_DEP_MASK_CLOCK_UNSTABLE)
+ return -EINVAL;
+
+ if (!can_stop_full_tick(cpu, ts))
+ return -EAGAIN;
+
+ tick_nohz_stop_sched_tick(ts, ktime_get(), cpu);
+ return 0;
+}
+#endif
+
static bool can_stop_idle_tick(int cpu, struct tick_sched *ts)
{
/*
--
2.1.2


From 1583349781236557232@xxx Mon Nov 06 20:27:00 +0000 2017
X-GM-THRID: 1582821193795495794
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread