2024-04-16 01:15:52

by Elizabeth Figura

[permalink] [raw]
Subject: [PATCH v4 00/30] NT synchronization primitive driver

This patch series implements a new char misc driver, /dev/ntsync, which is used
to implement Windows NT synchronization primitives.

NT synchronization primitives are unique in that the wait functions both are
vectored, operate on multiple types of object with different behaviour (mutex,
semaphore, event), and affect the state of the objects they wait on. This model
is not compatible with existing kernel synchronization objects or interfaces,
and therefore the ntsync driver implements its own wait queues and locking.

Hence I would like to request review from someone familiar with locking to make
sure that the usage of low-level kernel primitives is correct and that the wait
queues work as intended, and to that end I've CC'd the locking maintainers.

== Background ==

The Wine project emulates the Windows API in user space. One particular part of
that API, namely the NT synchronization primitives, have historically been
implemented via RPC to a dedicated "kernel" process. However, more recent
applications use these APIs more strenuously, and the overhead of RPC has become
a bottleneck.

The NT synchronization APIs are too complex to implement on top of existing
primitives without sacrificing correctness. Certain operations, such as
NtPulseEvent() or the "wait-for-all" mode of NtWaitForMultipleObjects(), require
direct control over the underlying wait queue, and implementing a wait queue
sufficiently robust for Wine in user space is not possible. This proposed
driver, therefore, implements the problematic interfaces directly in the Linux
kernel.

This driver was presented at Linux Plumbers Conference 2023. For those further
interested in the history of synchronization in Wine and past attempts to solve
this problem in user space, a recording of the presentation can be viewed here:

https://www.youtube.com/watch?v=NjU4nyWyhU8


== Performance ==

The gain in performance varies wildly depending on the application in question
and the user's hardware. For some games NT synchronization is not a bottleneck
and no change can be observed, but for others frame rate improvements of 50 to
150 percent are not atypical. The following table lists frame rate measurements
from a variety of games on a variety of hardware, taken by users Dmitry
Skvortsov, FuzzyQuils, OnMars, and myself:

Game Upstream ntsync improvement
===========================================================================
Anger Foot 69 99 43%
Call of Juarez 99.8 224.1 125%
Dirt 3 110.6 860.7 678%
Forza Horizon 5 108 160 48%
Lara Croft: Temple of Osiris 141 326 131%
Metro 2033 164.4 199.2 21%
Resident Evil 2 26 77 196%
The Crew 26 51 96%
Tiny Tina's Wonderlands 130 360 177%
Total War Saga: Troy 109 146 34%
===========================================================================


== Patches ==

The intended semantics of the patches are broadly intended to match those of the
corresponding Windows functions. For those not already familiar with the Windows
functions (or their undocumented behaviour), patch 27/27 provides a detailed
specification, and individual patches also include a brief description of the
API they are implementing.

The patches making use of this driver in Wine can be retrieved or browsed here:

https://repo.or.cz/wine/zf.git/shortlog/refs/heads/ntsync5


== Implementation ==

Some aspects of the implementation may deserve particular comment:

* In the interest of performance, each object is governed only by a single
spinlock. However, NTSYNC_IOC_WAIT_ALL requires that the state of multiple
objects be changed as a single atomic operation. In order to achieve this, we
first take a device-wide lock ("wait_all_lock") any time we are going to lock
more than one object at a time.

The maximum number of objects that can be used in a vectored wait, and
therefore the maximum that can be locked simultaneously, is 64. This number is
NT's own limit.

The acquisition of multiple spinlocks will degrade performance. This is a
conscious choice, however. Wait-for-all is known to be a very rare operation
in practice, especially with counts that approach the maximum, and it is the
intent of the ntsync driver to optimize wait-for-any at the expense of
wait-for-all as much as possible.

* NT mutexes are tied to their threads on an OS level, and the kernel includes
builtin support for "robust" mutexes. In order to keep the ntsync driver
self-contained and avoid touching more code than necessary, it does not hook
into task exit nor use pids.

Instead, the user space emulator is expected to manage thread IDs and pass
them as an argument to any relevant functions; this is the "owner" field of
ntsync_wait_args and ntsync_mutex_args.

When the emulator detects that a thread dies, it should therefore call
NTSYNC_IOC_MUTEX_KILL on any open mutexes.

* ntsync is module-capable mostly because there was nothing preventing it, and
because it aided development. It is not a hard requirement, though.


== Previous versions ==

Changes from v3:

* Add .gitignore and use KHDR_INCLUDES in selftest build files, per Muhammad
Usama Anjum.

* Try to explain why we are rolling our own primitives a little better, per Greg
Kroah-Hartman.

* Link to v3: https://lore.kernel.org/lkml/[email protected]/
* Link to v2: https://lore.kernel.org/lkml/[email protected]/
* Link to v1: https://lore.kernel.org/lkml/[email protected]/
* Link to RFC v2: https://lore.kernel.org/lkml/[email protected]/
* Link to RFC v1: https://lore.kernel.org/lkml/[email protected]/

Elizabeth Figura (27):
ntsync: Introduce NTSYNC_IOC_WAIT_ANY.
ntsync: Introduce NTSYNC_IOC_WAIT_ALL.
ntsync: Introduce NTSYNC_IOC_CREATE_MUTEX.
ntsync: Introduce NTSYNC_IOC_MUTEX_UNLOCK.
ntsync: Introduce NTSYNC_IOC_MUTEX_KILL.
ntsync: Introduce NTSYNC_IOC_CREATE_EVENT.
ntsync: Introduce NTSYNC_IOC_EVENT_SET.
ntsync: Introduce NTSYNC_IOC_EVENT_RESET.
ntsync: Introduce NTSYNC_IOC_EVENT_PULSE.
ntsync: Introduce NTSYNC_IOC_SEM_READ.
ntsync: Introduce NTSYNC_IOC_MUTEX_READ.
ntsync: Introduce NTSYNC_IOC_EVENT_READ.
ntsync: Introduce alertable waits.
selftests: ntsync: Add some tests for semaphore state.
selftests: ntsync: Add some tests for mutex state.
selftests: ntsync: Add some tests for NTSYNC_IOC_WAIT_ANY.
selftests: ntsync: Add some tests for NTSYNC_IOC_WAIT_ALL.
selftests: ntsync: Add some tests for wakeup signaling with
WINESYNC_IOC_WAIT_ANY.
selftests: ntsync: Add some tests for wakeup signaling with
WINESYNC_IOC_WAIT_ALL.
selftests: ntsync: Add some tests for manual-reset event state.
selftests: ntsync: Add some tests for auto-reset event state.
selftests: ntsync: Add some tests for wakeup signaling with events.
selftests: ntsync: Add tests for alertable waits.
selftests: ntsync: Add some tests for wakeup signaling via alerts.
selftests: ntsync: Add a stress test for contended waits.
maintainers: Add an entry for ntsync.
docs: ntsync: Add documentation for the ntsync uAPI.

Documentation/userspace-api/index.rst | 1 +
Documentation/userspace-api/ntsync.rst | 399 +++++
MAINTAINERS | 9 +
drivers/misc/ntsync.c | 925 ++++++++++-
include/uapi/linux/ntsync.h | 39 +
tools/testing/selftests/Makefile | 1 +
.../selftests/drivers/ntsync/.gitignore | 1 +
.../testing/selftests/drivers/ntsync/Makefile | 7 +
tools/testing/selftests/drivers/ntsync/config | 1 +
.../testing/selftests/drivers/ntsync/ntsync.c | 1407 +++++++++++++++++
10 files changed, 2786 insertions(+), 4 deletions(-)
create mode 100644 Documentation/userspace-api/ntsync.rst
create mode 100644 tools/testing/selftests/drivers/ntsync/.gitignore
create mode 100644 tools/testing/selftests/drivers/ntsync/Makefile
create mode 100644 tools/testing/selftests/drivers/ntsync/config
create mode 100644 tools/testing/selftests/drivers/ntsync/ntsync.c


base-commit: ebbc1a4789c666846b9854ef845a37a64879e5f9
--
2.43.0



2024-04-16 01:16:35

by Elizabeth Figura

[permalink] [raw]
Subject: [PATCH v4 06/27] ntsync: Introduce NTSYNC_IOC_CREATE_EVENT.

This correspond to the NT syscall NtCreateEvent().

An NT event holds a single bit of state denoting whether it is signaled or
unsignaled.

There are two types of events: manual-reset and automatic-reset. When an
automatic-reset event is acquired via a wait function, its state is reset to
unsignaled. Manual-reset events are not affected by wait functions.

Whether the event is manual-reset, and its initial state, are specified at
creation time.

Signed-off-by: Elizabeth Figura <[email protected]>
---
drivers/misc/ntsync.c | 61 +++++++++++++++++++++++++++++++++++++
include/uapi/linux/ntsync.h | 7 +++++
2 files changed, 68 insertions(+)

diff --git a/drivers/misc/ntsync.c b/drivers/misc/ntsync.c
index 1e68f96bc2a6..3e125c805c00 100644
--- a/drivers/misc/ntsync.c
+++ b/drivers/misc/ntsync.c
@@ -25,6 +25,7 @@
enum ntsync_type {
NTSYNC_TYPE_SEM,
NTSYNC_TYPE_MUTEX,
+ NTSYNC_TYPE_EVENT,
};

/*
@@ -59,6 +60,10 @@ struct ntsync_obj {
__u32 owner;
bool ownerdead;
} mutex;
+ struct {
+ bool manual;
+ bool signaled;
+ } event;
} u;

/*
@@ -143,6 +148,8 @@ static bool is_signaled(struct ntsync_obj *obj, __u32 owner)
if (obj->u.mutex.owner && obj->u.mutex.owner != owner)
return false;
return obj->u.mutex.count < UINT_MAX;
+ case NTSYNC_TYPE_EVENT:
+ return obj->u.event.signaled;
}

WARN(1, "bad object type %#x\n", obj->type);
@@ -193,6 +200,10 @@ static void try_wake_all(struct ntsync_device *dev, struct ntsync_q *q,
obj->u.mutex.count++;
obj->u.mutex.owner = q->owner;
break;
+ case NTSYNC_TYPE_EVENT:
+ if (!obj->u.event.manual)
+ obj->u.event.signaled = false;
+ break;
}
}
wake_up_process(q->task);
@@ -261,6 +272,27 @@ static void try_wake_any_mutex(struct ntsync_obj *mutex)
}
}

+static void try_wake_any_event(struct ntsync_obj *event)
+{
+ struct ntsync_q_entry *entry;
+
+ lockdep_assert_held(&event->lock);
+
+ list_for_each_entry(entry, &event->any_waiters, node) {
+ struct ntsync_q *q = entry->q;
+ int signaled = -1;
+
+ if (!event->u.event.signaled)
+ break;
+
+ if (atomic_try_cmpxchg(&q->signaled, &signaled, entry->index)) {
+ if (!event->u.event.manual)
+ event->u.event.signaled = false;
+ wake_up_process(q->task);
+ }
+ }
+}
+
/*
* Actually change the semaphore state, returning -EOVERFLOW if it is made
* invalid.
@@ -569,6 +601,30 @@ static int ntsync_create_mutex(struct ntsync_device *dev, void __user *argp)
return put_user(fd, &user_args->mutex);
}

+static int ntsync_create_event(struct ntsync_device *dev, void __user *argp)
+{
+ struct ntsync_event_args __user *user_args = argp;
+ struct ntsync_event_args args;
+ struct ntsync_obj *event;
+ int fd;
+
+ if (copy_from_user(&args, argp, sizeof(args)))
+ return -EFAULT;
+
+ event = ntsync_alloc_obj(dev, NTSYNC_TYPE_EVENT);
+ if (!event)
+ return -ENOMEM;
+ event->u.event.manual = args.manual;
+ event->u.event.signaled = args.signaled;
+ fd = ntsync_obj_get_fd(event);
+ if (fd < 0) {
+ kfree(event);
+ return fd;
+ }
+
+ return put_user(fd, &user_args->event);
+}
+
static struct ntsync_obj *get_obj(struct ntsync_device *dev, int fd)
{
struct file *file = fget(fd);
@@ -702,6 +758,9 @@ static void try_wake_any_obj(struct ntsync_obj *obj)
case NTSYNC_TYPE_MUTEX:
try_wake_any_mutex(obj);
break;
+ case NTSYNC_TYPE_EVENT:
+ try_wake_any_event(obj);
+ break;
}
}

@@ -890,6 +949,8 @@ static long ntsync_char_ioctl(struct file *file, unsigned int cmd,
void __user *argp = (void __user *)parm;

switch (cmd) {
+ case NTSYNC_IOC_CREATE_EVENT:
+ return ntsync_create_event(dev, argp);
case NTSYNC_IOC_CREATE_MUTEX:
return ntsync_create_mutex(dev, argp);
case NTSYNC_IOC_CREATE_SEM:
diff --git a/include/uapi/linux/ntsync.h b/include/uapi/linux/ntsync.h
index 1bff8f19d6d9..0d133f2eaf0b 100644
--- a/include/uapi/linux/ntsync.h
+++ b/include/uapi/linux/ntsync.h
@@ -22,6 +22,12 @@ struct ntsync_mutex_args {
__u32 count;
};

+struct ntsync_event_args {
+ __u32 event;
+ __u32 manual;
+ __u32 signaled;
+};
+
#define NTSYNC_WAIT_REALTIME 0x1

struct ntsync_wait_args {
@@ -41,6 +47,7 @@ struct ntsync_wait_args {
#define NTSYNC_IOC_WAIT_ANY _IOWR('N', 0x82, struct ntsync_wait_args)
#define NTSYNC_IOC_WAIT_ALL _IOWR('N', 0x83, struct ntsync_wait_args)
#define NTSYNC_IOC_CREATE_MUTEX _IOWR('N', 0x84, struct ntsync_sem_args)
+#define NTSYNC_IOC_CREATE_EVENT _IOWR('N', 0x87, struct ntsync_event_args)

#define NTSYNC_IOC_SEM_POST _IOWR('N', 0x81, __u32)
#define NTSYNC_IOC_MUTEX_UNLOCK _IOWR('N', 0x85, struct ntsync_mutex_args)
--
2.43.0


2024-04-16 01:21:54

by Elizabeth Figura

[permalink] [raw]
Subject: [PATCH v4 07/27] ntsync: Introduce NTSYNC_IOC_EVENT_SET.

This corresponds to the NT syscall NtSetEvent().

This sets the event to the signaled state, and returns its previous state.

Signed-off-by: Elizabeth Figura <[email protected]>
---
drivers/misc/ntsync.c | 37 +++++++++++++++++++++++++++++++++++++
include/uapi/linux/ntsync.h | 1 +
2 files changed, 38 insertions(+)

diff --git a/drivers/misc/ntsync.c b/drivers/misc/ntsync.c
index 3e125c805c00..69f359241cf6 100644
--- a/drivers/misc/ntsync.c
+++ b/drivers/misc/ntsync.c
@@ -473,6 +473,41 @@ static int ntsync_mutex_kill(struct ntsync_obj *mutex, void __user *argp)
return ret;
}

+static int ntsync_event_set(struct ntsync_obj *event, void __user *argp)
+{
+ struct ntsync_device *dev = event->dev;
+ __u32 prev_state;
+
+ if (event->type != NTSYNC_TYPE_EVENT)
+ return -EINVAL;
+
+ if (atomic_read(&event->all_hint) > 0) {
+ spin_lock(&dev->wait_all_lock);
+ spin_lock_nest_lock(&event->lock, &dev->wait_all_lock);
+
+ prev_state = event->u.event.signaled;
+ event->u.event.signaled = true;
+ try_wake_all_obj(dev, event);
+ try_wake_any_event(event);
+
+ spin_unlock(&event->lock);
+ spin_unlock(&dev->wait_all_lock);
+ } else {
+ spin_lock(&event->lock);
+
+ prev_state = event->u.event.signaled;
+ event->u.event.signaled = true;
+ try_wake_any_event(event);
+
+ spin_unlock(&event->lock);
+ }
+
+ if (put_user(prev_state, (__u32 __user *)argp))
+ return -EFAULT;
+
+ return 0;
+}
+
static int ntsync_obj_release(struct inode *inode, struct file *file)
{
struct ntsync_obj *obj = file->private_data;
@@ -496,6 +531,8 @@ static long ntsync_obj_ioctl(struct file *file, unsigned int cmd,
return ntsync_mutex_unlock(obj, argp);
case NTSYNC_IOC_MUTEX_KILL:
return ntsync_mutex_kill(obj, argp);
+ case NTSYNC_IOC_EVENT_SET:
+ return ntsync_event_set(obj, argp);
default:
return -ENOIOCTLCMD;
}
diff --git a/include/uapi/linux/ntsync.h b/include/uapi/linux/ntsync.h
index 0d133f2eaf0b..65329d15a472 100644
--- a/include/uapi/linux/ntsync.h
+++ b/include/uapi/linux/ntsync.h
@@ -52,5 +52,6 @@ struct ntsync_wait_args {
#define NTSYNC_IOC_SEM_POST _IOWR('N', 0x81, __u32)
#define NTSYNC_IOC_MUTEX_UNLOCK _IOWR('N', 0x85, struct ntsync_mutex_args)
#define NTSYNC_IOC_MUTEX_KILL _IOW ('N', 0x86, __u32)
+#define NTSYNC_IOC_EVENT_SET _IOR ('N', 0x88, __u32)

#endif
--
2.43.0


2024-04-16 01:22:24

by Elizabeth Figura

[permalink] [raw]
Subject: [PATCH v4 02/27] ntsync: Introduce NTSYNC_IOC_WAIT_ALL.

This is similar to NTSYNC_IOC_WAIT_ANY, but waits until all of the objects are
simultaneously signaled, and then acquires all of them as a single atomic
operation.

Signed-off-by: Elizabeth Figura <[email protected]>
---
drivers/misc/ntsync.c | 243 ++++++++++++++++++++++++++++++++++--
include/uapi/linux/ntsync.h | 1 +
2 files changed, 236 insertions(+), 8 deletions(-)

diff --git a/drivers/misc/ntsync.c b/drivers/misc/ntsync.c
index c6f84a5fc8c0..e914d626465a 100644
--- a/drivers/misc/ntsync.c
+++ b/drivers/misc/ntsync.c
@@ -55,7 +55,34 @@ struct ntsync_obj {
} sem;
} u;

+ /*
+ * any_waiters is protected by the object lock, but all_waiters is
+ * protected by the device wait_all_lock.
+ */
struct list_head any_waiters;
+ struct list_head all_waiters;
+
+ /*
+ * Hint describing how many tasks are queued on this object in a
+ * wait-all operation.
+ *
+ * Any time we do a wake, we may need to wake "all" waiters as well as
+ * "any" waiters. In order to atomically wake "all" waiters, we must
+ * lock all of the objects, and that means grabbing the wait_all_lock
+ * below (and, due to lock ordering rules, before locking this object).
+ * However, wait-all is a rare operation, and grabbing the wait-all
+ * lock for every wake would create unnecessary contention.
+ * Therefore we first check whether all_hint is zero, and, if it is,
+ * we skip trying to wake "all" waiters.
+ *
+ * This hint isn't protected by any lock. It might change during the
+ * course of a wake, but there's no meaningful race there; it's only a
+ * hint.
+ *
+ * Since wait requests must originate from user-space threads, we're
+ * limited here by PID_MAX_LIMIT, so there's no risk of overflow.
+ */
+ atomic_t all_hint;
};

struct ntsync_q_entry {
@@ -76,14 +103,100 @@ struct ntsync_q {
*/
atomic_t signaled;

+ bool all;
__u32 count;
struct ntsync_q_entry entries[];
};

struct ntsync_device {
+ /*
+ * Wait-all operations must atomically grab all objects, and be totally
+ * ordered with respect to each other and wait-any operations.
+ * If one thread is trying to acquire several objects, another thread
+ * cannot touch the object at the same time.
+ *
+ * We achieve this by grabbing multiple object locks at the same time.
+ * However, this creates a lock ordering problem. To solve that problem,
+ * wait_all_lock is taken first whenever multiple objects must be locked
+ * at the same time.
+ */
+ spinlock_t wait_all_lock;
+
struct file *file;
};

+static bool is_signaled(struct ntsync_obj *obj, __u32 owner)
+{
+ lockdep_assert_held(&obj->lock);
+
+ switch (obj->type) {
+ case NTSYNC_TYPE_SEM:
+ return !!obj->u.sem.count;
+ }
+
+ WARN(1, "bad object type %#x\n", obj->type);
+ return false;
+}
+
+/*
+ * "locked_obj" is an optional pointer to an object which is already locked and
+ * should not be locked again. This is necessary so that changing an object's
+ * state and waking it can be a single atomic operation.
+ */
+static void try_wake_all(struct ntsync_device *dev, struct ntsync_q *q,
+ struct ntsync_obj *locked_obj)
+{
+ __u32 count = q->count;
+ bool can_wake = true;
+ int signaled = -1;
+ __u32 i;
+
+ lockdep_assert_held(&dev->wait_all_lock);
+ if (locked_obj)
+ lockdep_assert_held(&locked_obj->lock);
+
+ for (i = 0; i < count; i++) {
+ if (q->entries[i].obj != locked_obj)
+ spin_lock_nest_lock(&q->entries[i].obj->lock, &dev->wait_all_lock);
+ }
+
+ for (i = 0; i < count; i++) {
+ if (!is_signaled(q->entries[i].obj, q->owner)) {
+ can_wake = false;
+ break;
+ }
+ }
+
+ if (can_wake && atomic_try_cmpxchg(&q->signaled, &signaled, 0)) {
+ for (i = 0; i < count; i++) {
+ struct ntsync_obj *obj = q->entries[i].obj;
+
+ switch (obj->type) {
+ case NTSYNC_TYPE_SEM:
+ obj->u.sem.count--;
+ break;
+ }
+ }
+ wake_up_process(q->task);
+ }
+
+ for (i = 0; i < count; i++) {
+ if (q->entries[i].obj != locked_obj)
+ spin_unlock(&q->entries[i].obj->lock);
+ }
+}
+
+static void try_wake_all_obj(struct ntsync_device *dev, struct ntsync_obj *obj)
+{
+ struct ntsync_q_entry *entry;
+
+ lockdep_assert_held(&dev->wait_all_lock);
+ lockdep_assert_held(&obj->lock);
+
+ list_for_each_entry(entry, &obj->all_waiters, node)
+ try_wake_all(dev, entry->q, obj);
+}
+
static void try_wake_any_sem(struct ntsync_obj *sem)
{
struct ntsync_q_entry *entry;
@@ -124,6 +237,7 @@ static int post_sem_state(struct ntsync_obj *sem, __u32 count)

static int ntsync_sem_post(struct ntsync_obj *sem, void __user *argp)
{
+ struct ntsync_device *dev = sem->dev;
__u32 __user *user_args = argp;
__u32 prev_count;
__u32 args;
@@ -135,14 +249,29 @@ static int ntsync_sem_post(struct ntsync_obj *sem, void __user *argp)
if (sem->type != NTSYNC_TYPE_SEM)
return -EINVAL;

- spin_lock(&sem->lock);
+ if (atomic_read(&sem->all_hint) > 0) {
+ spin_lock(&dev->wait_all_lock);
+ spin_lock_nest_lock(&sem->lock, &dev->wait_all_lock);

- prev_count = sem->u.sem.count;
- ret = post_sem_state(sem, args);
- if (!ret)
- try_wake_any_sem(sem);
+ prev_count = sem->u.sem.count;
+ ret = post_sem_state(sem, args);
+ if (!ret) {
+ try_wake_all_obj(dev, sem);
+ try_wake_any_sem(sem);
+ }

- spin_unlock(&sem->lock);
+ spin_unlock(&sem->lock);
+ spin_unlock(&dev->wait_all_lock);
+ } else {
+ spin_lock(&sem->lock);
+
+ prev_count = sem->u.sem.count;
+ ret = post_sem_state(sem, args);
+ if (!ret)
+ try_wake_any_sem(sem);
+
+ spin_unlock(&sem->lock);
+ }

if (!ret && put_user(prev_count, user_args))
ret = -EFAULT;
@@ -195,6 +324,8 @@ static struct ntsync_obj *ntsync_alloc_obj(struct ntsync_device *dev,
get_file(dev->file);
spin_lock_init(&obj->lock);
INIT_LIST_HEAD(&obj->any_waiters);
+ INIT_LIST_HEAD(&obj->all_waiters);
+ atomic_set(&obj->all_hint, 0);

return obj;
}
@@ -306,7 +437,7 @@ static int ntsync_schedule(const struct ntsync_q *q, const struct ntsync_wait_ar
* Allocate and initialize the ntsync_q structure, but do not queue us yet.
*/
static int setup_wait(struct ntsync_device *dev,
- const struct ntsync_wait_args *args,
+ const struct ntsync_wait_args *args, bool all,
struct ntsync_q **ret_q)
{
const __u32 count = args->count;
@@ -333,6 +464,7 @@ static int setup_wait(struct ntsync_device *dev,
q->task = current;
q->owner = args->owner;
atomic_set(&q->signaled, -1);
+ q->all = all;
q->count = count;

for (i = 0; i < count; i++) {
@@ -342,6 +474,16 @@ static int setup_wait(struct ntsync_device *dev,
if (!obj)
goto err;

+ if (all) {
+ /* Check that the objects are all distinct. */
+ for (j = 0; j < i; j++) {
+ if (obj == q->entries[j].obj) {
+ put_obj(obj);
+ goto err;
+ }
+ }
+ }
+
entry->obj = obj;
entry->q = q;
entry->index = i;
@@ -377,7 +519,7 @@ static int ntsync_wait_any(struct ntsync_device *dev, void __user *argp)
if (copy_from_user(&args, argp, sizeof(args)))
return -EFAULT;

- ret = setup_wait(dev, &args, &q);
+ ret = setup_wait(dev, &args, false, &q);
if (ret < 0)
return ret;

@@ -439,6 +581,87 @@ static int ntsync_wait_any(struct ntsync_device *dev, void __user *argp)
return ret;
}

+static int ntsync_wait_all(struct ntsync_device *dev, void __user *argp)
+{
+ struct ntsync_wait_args args;
+ struct ntsync_q *q;
+ int signaled;
+ __u32 i;
+ int ret;
+
+ if (copy_from_user(&args, argp, sizeof(args)))
+ return -EFAULT;
+
+ ret = setup_wait(dev, &args, true, &q);
+ if (ret < 0)
+ return ret;
+
+ /* queue ourselves */
+
+ spin_lock(&dev->wait_all_lock);
+
+ for (i = 0; i < args.count; i++) {
+ struct ntsync_q_entry *entry = &q->entries[i];
+ struct ntsync_obj *obj = entry->obj;
+
+ atomic_inc(&obj->all_hint);
+
+ /*
+ * obj->all_waiters is protected by dev->wait_all_lock rather
+ * than obj->lock, so there is no need to acquire obj->lock
+ * here.
+ */
+ list_add_tail(&entry->node, &obj->all_waiters);
+ }
+
+ /* check if we are already signaled */
+
+ try_wake_all(dev, q, NULL);
+
+ spin_unlock(&dev->wait_all_lock);
+
+ /* sleep */
+
+ ret = ntsync_schedule(q, &args);
+
+ /* and finally, unqueue */
+
+ spin_lock(&dev->wait_all_lock);
+
+ for (i = 0; i < args.count; i++) {
+ struct ntsync_q_entry *entry = &q->entries[i];
+ struct ntsync_obj *obj = entry->obj;
+
+ /*
+ * obj->all_waiters is protected by dev->wait_all_lock rather
+ * than obj->lock, so there is no need to acquire it here.
+ */
+ list_del(&entry->node);
+
+ atomic_dec(&obj->all_hint);
+
+ put_obj(obj);
+ }
+
+ spin_unlock(&dev->wait_all_lock);
+
+ signaled = atomic_read(&q->signaled);
+ if (signaled != -1) {
+ struct ntsync_wait_args __user *user_args = argp;
+
+ /* even if we caught a signal, we need to communicate success */
+ ret = 0;
+
+ if (put_user(signaled, &user_args->index))
+ ret = -EFAULT;
+ } else if (!ret) {
+ ret = -ETIMEDOUT;
+ }
+
+ kfree(q);
+ return ret;
+}
+
static int ntsync_char_open(struct inode *inode, struct file *file)
{
struct ntsync_device *dev;
@@ -447,6 +670,8 @@ static int ntsync_char_open(struct inode *inode, struct file *file)
if (!dev)
return -ENOMEM;

+ spin_lock_init(&dev->wait_all_lock);
+
file->private_data = dev;
dev->file = file;
return nonseekable_open(inode, file);
@@ -470,6 +695,8 @@ static long ntsync_char_ioctl(struct file *file, unsigned int cmd,
switch (cmd) {
case NTSYNC_IOC_CREATE_SEM:
return ntsync_create_sem(dev, argp);
+ case NTSYNC_IOC_WAIT_ALL:
+ return ntsync_wait_all(dev, argp);
case NTSYNC_IOC_WAIT_ANY:
return ntsync_wait_any(dev, argp);
default:
diff --git a/include/uapi/linux/ntsync.h b/include/uapi/linux/ntsync.h
index 60ad414b5552..83784d4438a1 100644
--- a/include/uapi/linux/ntsync.h
+++ b/include/uapi/linux/ntsync.h
@@ -33,6 +33,7 @@ struct ntsync_wait_args {

#define NTSYNC_IOC_CREATE_SEM _IOWR('N', 0x80, struct ntsync_sem_args)
#define NTSYNC_IOC_WAIT_ANY _IOWR('N', 0x82, struct ntsync_wait_args)
+#define NTSYNC_IOC_WAIT_ALL _IOWR('N', 0x83, struct ntsync_wait_args)

#define NTSYNC_IOC_SEM_POST _IOWR('N', 0x81, __u32)

--
2.43.0


2024-04-16 01:22:28

by Elizabeth Figura

[permalink] [raw]
Subject: [PATCH v4 26/27] maintainers: Add an entry for ntsync.

Add myself as maintainer, supported by CodeWeavers.

Signed-off-by: Elizabeth Figura <[email protected]>
---
MAINTAINERS | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 41a013dfebbc..09ae011a8d91 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -15728,6 +15728,15 @@ T: git https://github.com/Paragon-Software-Group/linux-ntfs3.git
F: Documentation/filesystems/ntfs3.rst
F: fs/ntfs3/

+NTSYNC SYNCHRONIZATION PRIMITIVE DRIVER
+M: Elizabeth Figura <[email protected]>
+L: [email protected]
+S: Supported
+F: Documentation/userspace-api/ntsync.rst
+F: drivers/misc/ntsync.c
+F: include/uapi/linux/ntsync.h
+F: tools/testing/selftests/drivers/ntsync/
+
NUBUS SUBSYSTEM
M: Finn Thain <[email protected]>
L: [email protected]
--
2.43.0


2024-04-16 01:22:43

by Elizabeth Figura

[permalink] [raw]
Subject: [PATCH v4 14/27] selftests: ntsync: Add some tests for semaphore state.

Wine has tests for its synchronization primitives, but these are more accessible
to kernel developers, and also allow us to test some edge cases that Wine does
not care about.

This patch adds tests for semaphore-specific ioctls NTSYNC_IOC_SEM_POST and
NTSYNC_IOC_SEM_READ, and waiting on semaphores.

Signed-off-by: Elizabeth Figura <[email protected]>
---
tools/testing/selftests/Makefile | 1 +
.../selftests/drivers/ntsync/.gitignore | 1 +
.../testing/selftests/drivers/ntsync/Makefile | 7 +
tools/testing/selftests/drivers/ntsync/config | 1 +
.../testing/selftests/drivers/ntsync/ntsync.c | 149 ++++++++++++++++++
5 files changed, 159 insertions(+)
create mode 100644 tools/testing/selftests/drivers/ntsync/.gitignore
create mode 100644 tools/testing/selftests/drivers/ntsync/Makefile
create mode 100644 tools/testing/selftests/drivers/ntsync/config
create mode 100644 tools/testing/selftests/drivers/ntsync/ntsync.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index e1504833654d..6f95206325e1 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -16,6 +16,7 @@ TARGETS += damon
TARGETS += devices
TARGETS += dmabuf-heaps
TARGETS += drivers/dma-buf
+TARGETS += drivers/ntsync
TARGETS += drivers/s390x/uvdevice
TARGETS += drivers/net/bonding
TARGETS += drivers/net/team
diff --git a/tools/testing/selftests/drivers/ntsync/.gitignore b/tools/testing/selftests/drivers/ntsync/.gitignore
new file mode 100644
index 000000000000..848573a3d3ea
--- /dev/null
+++ b/tools/testing/selftests/drivers/ntsync/.gitignore
@@ -0,0 +1 @@
+ntsync
diff --git a/tools/testing/selftests/drivers/ntsync/Makefile b/tools/testing/selftests/drivers/ntsync/Makefile
new file mode 100644
index 000000000000..dbf2b055c0b2
--- /dev/null
+++ b/tools/testing/selftests/drivers/ntsync/Makefile
@@ -0,0 +1,7 @@
+# SPDX-LICENSE-IDENTIFIER: GPL-2.0-only
+TEST_GEN_PROGS := ntsync
+
+CFLAGS += $(KHDR_INCLUDES)
+LDLIBS += -lpthread
+
+include ../../lib.mk
diff --git a/tools/testing/selftests/drivers/ntsync/config b/tools/testing/selftests/drivers/ntsync/config
new file mode 100644
index 000000000000..60539c826d06
--- /dev/null
+++ b/tools/testing/selftests/drivers/ntsync/config
@@ -0,0 +1 @@
+CONFIG_WINESYNC=y
diff --git a/tools/testing/selftests/drivers/ntsync/ntsync.c b/tools/testing/selftests/drivers/ntsync/ntsync.c
new file mode 100644
index 000000000000..1e145c6dfded
--- /dev/null
+++ b/tools/testing/selftests/drivers/ntsync/ntsync.c
@@ -0,0 +1,149 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Various unit tests for the "ntsync" synchronization primitive driver.
+ *
+ * Copyright (C) 2021-2022 Elizabeth Figura <[email protected]>
+ */
+
+#define _GNU_SOURCE
+#include <sys/ioctl.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <time.h>
+#include <pthread.h>
+#include <linux/ntsync.h>
+#include "../../kselftest_harness.h"
+
+static int read_sem_state(int sem, __u32 *count, __u32 *max)
+{
+ struct ntsync_sem_args args;
+ int ret;
+
+ memset(&args, 0xcc, sizeof(args));
+ ret = ioctl(sem, NTSYNC_IOC_SEM_READ, &args);
+ *count = args.count;
+ *max = args.max;
+ return ret;
+}
+
+#define check_sem_state(sem, count, max) \
+ ({ \
+ __u32 __count, __max; \
+ int ret = read_sem_state((sem), &__count, &__max); \
+ EXPECT_EQ(0, ret); \
+ EXPECT_EQ((count), __count); \
+ EXPECT_EQ((max), __max); \
+ })
+
+static int post_sem(int sem, __u32 *count)
+{
+ return ioctl(sem, NTSYNC_IOC_SEM_POST, count);
+}
+
+static int wait_any(int fd, __u32 count, const int *objs, __u32 owner, __u32 *index)
+{
+ struct ntsync_wait_args args = {0};
+ struct timespec timeout;
+ int ret;
+
+ clock_gettime(CLOCK_MONOTONIC, &timeout);
+
+ args.timeout = timeout.tv_sec * 1000000000 + timeout.tv_nsec;
+ args.count = count;
+ args.objs = (uintptr_t)objs;
+ args.owner = owner;
+ args.index = 0xdeadbeef;
+ ret = ioctl(fd, NTSYNC_IOC_WAIT_ANY, &args);
+ *index = args.index;
+ return ret;
+}
+
+TEST(semaphore_state)
+{
+ struct ntsync_sem_args sem_args;
+ struct timespec timeout;
+ __u32 count, index;
+ int fd, ret, sem;
+
+ clock_gettime(CLOCK_MONOTONIC, &timeout);
+
+ fd = open("/dev/ntsync", O_CLOEXEC | O_RDONLY);
+ ASSERT_LE(0, fd);
+
+ sem_args.count = 3;
+ sem_args.max = 2;
+ sem_args.sem = 0xdeadbeef;
+ ret = ioctl(fd, NTSYNC_IOC_CREATE_SEM, &sem_args);
+ EXPECT_EQ(-1, ret);
+ EXPECT_EQ(EINVAL, errno);
+
+ sem_args.count = 2;
+ sem_args.max = 2;
+ sem_args.sem = 0xdeadbeef;
+ ret = ioctl(fd, NTSYNC_IOC_CREATE_SEM, &sem_args);
+ EXPECT_EQ(0, ret);
+ EXPECT_NE(0xdeadbeef, sem_args.sem);
+ sem = sem_args.sem;
+ check_sem_state(sem, 2, 2);
+
+ count = 0;
+ ret = post_sem(sem, &count);
+ EXPECT_EQ(0, ret);
+ EXPECT_EQ(2, count);
+ check_sem_state(sem, 2, 2);
+
+ count = 1;
+ ret = post_sem(sem, &count);
+ EXPECT_EQ(-1, ret);
+ EXPECT_EQ(EOVERFLOW, errno);
+ check_sem_state(sem, 2, 2);
+
+ ret = wait_any(fd, 1, &sem, 123, &index);
+ EXPECT_EQ(0, ret);
+ EXPECT_EQ(0, index);
+ check_sem_state(sem, 1, 2);
+
+ ret = wait_any(fd, 1, &sem, 123, &index);
+ EXPECT_EQ(0, ret);
+ EXPECT_EQ(0, index);
+ check_sem_state(sem, 0, 2);
+
+ ret = wait_any(fd, 1, &sem, 123, &index);
+ EXPECT_EQ(-1, ret);
+ EXPECT_EQ(ETIMEDOUT, errno);
+
+ count = 3;
+ ret = post_sem(sem, &count);
+ EXPECT_EQ(-1, ret);
+ EXPECT_EQ(EOVERFLOW, errno);
+ check_sem_state(sem, 0, 2);
+
+ count = 2;
+ ret = post_sem(sem, &count);
+ EXPECT_EQ(0, ret);
+ EXPECT_EQ(0, count);
+ check_sem_state(sem, 2, 2);
+
+ ret = wait_any(fd, 1, &sem, 123, &index);
+ EXPECT_EQ(0, ret);
+ ret = wait_any(fd, 1, &sem, 123, &index);
+ EXPECT_EQ(0, ret);
+
+ count = 1;
+ ret = post_sem(sem, &count);
+ EXPECT_EQ(0, ret);
+ EXPECT_EQ(0, count);
+ check_sem_state(sem, 1, 2);
+
+ count = ~0u;
+ ret = post_sem(sem, &count);
+ EXPECT_EQ(-1, ret);
+ EXPECT_EQ(EOVERFLOW, errno);
+ check_sem_state(sem, 1, 2);
+
+ close(sem);
+
+ close(fd);
+}
+
+TEST_HARNESS_MAIN
--
2.43.0


2024-04-16 01:23:44

by Elizabeth Figura

[permalink] [raw]
Subject: [PATCH v4 27/27] docs: ntsync: Add documentation for the ntsync uAPI.

Add an overall explanation of the driver architecture, and complete and precise
specification for its intended behaviour.

Reviewed-by: Bagas Sanjaya <[email protected]>
Signed-off-by: Elizabeth Figura <[email protected]>
---
Documentation/userspace-api/index.rst | 1 +
Documentation/userspace-api/ntsync.rst | 399 +++++++++++++++++++++++++
2 files changed, 400 insertions(+)
create mode 100644 Documentation/userspace-api/ntsync.rst

diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst
index afecfe3cc4a8..d5745a500fa7 100644
--- a/Documentation/userspace-api/index.rst
+++ b/Documentation/userspace-api/index.rst
@@ -62,6 +62,7 @@ Everything else
vduse
futex2
perf_ring_buffer
+ ntsync

.. only:: subproject and html

diff --git a/Documentation/userspace-api/ntsync.rst b/Documentation/userspace-api/ntsync.rst
new file mode 100644
index 000000000000..202c2350d3af
--- /dev/null
+++ b/Documentation/userspace-api/ntsync.rst
@@ -0,0 +1,399 @@
+===================================
+NT synchronization primitive driver
+===================================
+
+This page documents the user-space API for the ntsync driver.
+
+ntsync is a support driver for emulation of NT synchronization
+primitives by user-space NT emulators. It exists because implementation
+in user-space, using existing tools, cannot match Windows performance
+while offering accurate semantics. It is implemented entirely in
+software, and does not drive any hardware device.
+
+This interface is meant as a compatibility tool only, and should not
+be used for general synchronization. Instead use generic, versatile
+interfaces such as futex(2) and poll(2).
+
+Synchronization primitives
+==========================
+
+The ntsync driver exposes three types of synchronization primitives:
+semaphores, mutexes, and events.
+
+A semaphore holds a single volatile 32-bit counter, and a static 32-bit
+integer denoting the maximum value. It is considered signaled when the
+counter is nonzero. The counter is decremented by one when a wait is
+satisfied. Both the initial and maximum count are established when the
+semaphore is created.
+
+A mutex holds a volatile 32-bit recursion count, and a volatile 32-bit
+identifier denoting its owner. A mutex is considered signaled when its
+owner is zero (indicating that it is not owned). The recursion count is
+incremented when a wait is satisfied, and ownership is set to the given
+identifier.
+
+A mutex also holds an internal flag denoting whether its previous owner
+has died; such a mutex is said to be abandoned. Owner death is not
+tracked automatically based on thread death, but rather must be
+communicated using ``NTSYNC_IOC_MUTEX_KILL``. An abandoned mutex is
+inherently considered unowned.
+
+Except for the "unowned" semantics of zero, the actual value of the
+owner identifier is not interpreted by the ntsync driver at all. The
+intended use is to store a thread identifier; however, the ntsync
+driver does not actually validate that a calling thread provides
+consistent or unique identifiers.
+
+An event holds a volatile boolean state denoting whether it is signaled
+or not. There are two types of events, auto-reset and manual-reset. An
+auto-reset event is designaled when a wait is satisfied; a manual-reset
+event is not. The event type is specified when the event is created.
+
+Unless specified otherwise, all operations on an object are atomic and
+totally ordered with respect to other operations on the same object.
+
+Objects are represented by files. When all file descriptors to an
+object are closed, that object is deleted.
+
+Char device
+===========
+
+The ntsync driver creates a single char device /dev/ntsync. Each file
+description opened on the device represents a unique instance intended
+to back an individual NT virtual machine. Objects created by one ntsync
+instance may only be used with other objects created by the same
+instance.
+
+ioctl reference
+===============
+
+All operations on the device are done through ioctls. There are four
+structures used in ioctl calls::
+
+ struct ntsync_sem_args {
+ __u32 sem;
+ __u32 count;
+ __u32 max;
+ };
+
+ struct ntsync_mutex_args {
+ __u32 mutex;
+ __u32 owner;
+ __u32 count;
+ };
+
+ struct ntsync_event_args {
+ __u32 event;
+ __u32 signaled;
+ __u32 manual;
+ };
+
+ struct ntsync_wait_args {
+ __u64 timeout;
+ __u64 objs;
+ __u32 count;
+ __u32 owner;
+ __u32 index;
+ __u32 alert;
+ __u32 flags;
+ __u32 pad;
+ };
+
+Depending on the ioctl, members of the structure may be used as input,
+output, or not at all. All ioctls return 0 on success.
+
+The ioctls on the device file are as follows:
+
+.. c:macro:: NTSYNC_IOC_CREATE_SEM
+
+ Create a semaphore object. Takes a pointer to struct
+ :c:type:`ntsync_sem_args`, which is used as follows:
+
+ .. list-table::
+
+ * - ``sem``
+ - On output, contains a file descriptor to the created semaphore.
+ * - ``count``
+ - Initial count of the semaphore.
+ * - ``max``
+ - Maximum count of the semaphore.
+
+ Fails with ``EINVAL`` if ``count`` is greater than ``max``.
+
+.. c:macro:: NTSYNC_IOC_CREATE_MUTEX
+
+ Create a mutex object. Takes a pointer to struct
+ :c:type:`ntsync_mutex_args`, which is used as follows:
+
+ .. list-table::
+
+ * - ``mutex``
+ - On output, contains a file descriptor to the created mutex.
+ * - ``count``
+ - Initial recursion count of the mutex.
+ * - ``owner``
+ - Initial owner of the mutex.
+
+ If ``owner`` is nonzero and ``count`` is zero, or if ``owner`` is
+ zero and ``count`` is nonzero, the function fails with ``EINVAL``.
+
+.. c:macro:: NTSYNC_IOC_CREATE_EVENT
+
+ Create an event object. Takes a pointer to struct
+ :c:type:`ntsync_event_args`, which is used as follows:
+
+ .. list-table::
+
+ * - ``event``
+ - On output, contains a file descriptor to the created event.
+ * - ``signaled``
+ - If nonzero, the event is initially signaled, otherwise
+ nonsignaled.
+ * - ``manual``
+ - If nonzero, the event is a manual-reset event, otherwise
+ auto-reset.
+
+The ioctls on the individual objects are as follows:
+
+.. c:macro:: NTSYNC_IOC_SEM_POST
+
+ Post to a semaphore object. Takes a pointer to a 32-bit integer,
+ which on input holds the count to be added to the semaphore, and on
+ output contains its previous count.
+
+ If adding to the semaphore's current count would raise the latter
+ past the semaphore's maximum count, the ioctl fails with
+ ``EOVERFLOW`` and the semaphore is not affected. If raising the
+ semaphore's count causes it to become signaled, eligible threads
+ waiting on this semaphore will be woken and the semaphore's count
+ decremented appropriately.
+
+.. c:macro:: NTSYNC_IOC_MUTEX_UNLOCK
+
+ Release a mutex object. Takes a pointer to struct
+ :c:type:`ntsync_mutex_args`, which is used as follows:
+
+ .. list-table::
+
+ * - ``mutex``
+ - Ignored.
+ * - ``owner``
+ - Specifies the owner trying to release this mutex.
+ * - ``count``
+ - On output, contains the previous recursion count.
+
+ If ``owner`` is zero, the ioctl fails with ``EINVAL``. If ``owner``
+ is not the current owner of the mutex, the ioctl fails with
+ ``EPERM``.
+
+ The mutex's count will be decremented by one. If decrementing the
+ mutex's count causes it to become zero, the mutex is marked as
+ unowned and signaled, and eligible threads waiting on it will be
+ woken as appropriate.
+
+.. c:macro:: NTSYNC_IOC_SET_EVENT
+
+ Signal an event object. Takes a pointer to a 32-bit integer, which on
+ output contains the previous state of the event.
+
+ Eligible threads will be woken, and auto-reset events will be
+ designaled appropriately.
+
+.. c:macro:: NTSYNC_IOC_RESET_EVENT
+
+ Designal an event object. Takes a pointer to a 32-bit integer, which
+ on output contains the previous state of the event.
+
+.. c:macro:: NTSYNC_IOC_PULSE_EVENT
+
+ Wake threads waiting on an event object while leaving it in an
+ unsignaled state. Takes a pointer to a 32-bit integer, which on
+ output contains the previous state of the event.
+
+ A pulse operation can be thought of as a set followed by a reset,
+ performed as a single atomic operation. If two threads are waiting on
+ an auto-reset event which is pulsed, only one will be woken. If two
+ threads are waiting a manual-reset event which is pulsed, both will
+ be woken. However, in both cases, the event will be unsignaled
+ afterwards, and a simultaneous read operation will always report the
+ event as unsignaled.
+
+.. c:macro:: NTSYNC_IOC_READ_SEM
+
+ Read the current state of a semaphore object. Takes a pointer to
+ struct :c:type:`ntsync_sem_args`, which is used as follows:
+
+ .. list-table::
+
+ * - ``sem``
+ - Ignored.
+ * - ``count``
+ - On output, contains the current count of the semaphore.
+ * - ``max``
+ - On output, contains the maximum count of the semaphore.
+
+.. c:macro:: NTSYNC_IOC_READ_MUTEX
+
+ Read the current state of a mutex object. Takes a pointer to struct
+ :c:type:`ntsync_mutex_args`, which is used as follows:
+
+ .. list-table::
+
+ * - ``mutex``
+ - Ignored.
+ * - ``owner``
+ - On output, contains the current owner of the mutex, or zero
+ if the mutex is not currently owned.
+ * - ``count``
+ - On output, contains the current recursion count of the mutex.
+
+ If the mutex is marked as abandoned, the function fails with
+ ``EOWNERDEAD``. In this case, ``count`` and ``owner`` are set to
+ zero.
+
+.. c:macro:: NTSYNC_IOC_READ_EVENT
+
+ Read the current state of an event object. Takes a pointer to struct
+ :c:type:`ntsync_event_args`, which is used as follows:
+
+ .. list-table::
+
+ * - ``event``
+ - Ignored.
+ * - ``signaled``
+ - On output, contains the current state of the event.
+ * - ``manual``
+ - On output, contains 1 if the event is a manual-reset event,
+ and 0 otherwise.
+
+.. c:macro:: NTSYNC_IOC_KILL_OWNER
+
+ Mark a mutex as unowned and abandoned if it is owned by the given
+ owner. Takes an input-only pointer to a 32-bit integer denoting the
+ owner. If the owner is zero, the ioctl fails with ``EINVAL``. If the
+ owner does not own the mutex, the function fails with ``EPERM``.
+
+ Eligible threads waiting on the mutex will be woken as appropriate
+ (and such waits will fail with ``EOWNERDEAD``, as described below).
+
+.. c:macro:: NTSYNC_IOC_WAIT_ANY
+
+ Poll on any of a list of objects, atomically acquiring at most one.
+ Takes a pointer to struct :c:type:`ntsync_wait_args`, which is
+ used as follows:
+
+ .. list-table::
+
+ * - ``timeout``
+ - Absolute timeout in nanoseconds. If ``NTSYNC_WAIT_REALTIME``
+ is set, the timeout is measured against the REALTIME clock;
+ otherwise it is measured against the MONOTONIC clock. If the
+ timeout is equal to or earlier than the current time, the
+ function returns immediately without sleeping. If ``timeout``
+ is U64_MAX, the function will sleep until an object is
+ signaled, and will not fail with ``ETIMEDOUT``.
+ * - ``objs``
+ - Pointer to an array of ``count`` file descriptors
+ (specified as an integer so that the structure has the same
+ size regardless of architecture). If any object is
+ invalid, the function fails with ``EINVAL``.
+ * - ``count``
+ - Number of objects specified in the ``objs`` array.
+ If greater than ``NTSYNC_MAX_WAIT_COUNT``, the function fails
+ with ``EINVAL``.
+ * - ``owner``
+ - Mutex owner identifier. If any object in ``objs`` is a mutex,
+ the ioctl will attempt to acquire that mutex on behalf of
+ ``owner``. If ``owner`` is zero, the ioctl fails with
+ ``EINVAL``.
+ * - ``index``
+ - On success, contains the index (into ``objs``) of the object
+ which was signaled. If ``alert`` was signaled instead,
+ this contains ``count``.
+ * - ``alert``
+ - Optional event object file descriptor. If nonzero, this
+ specifies an "alert" event object which, if signaled, will
+ terminate the wait. If nonzero, the identifier must point to a
+ valid event.
+ * - ``flags``
+ - Zero or more flags. Currently the only flag is
+ ``NTSYNC_WAIT_REALTIME``, which causes the timeout to be
+ measured against the REALTIME clock instead of MONOTONIC.
+ * - ``pad``
+ - Unused, must be set to zero.
+
+ This function attempts to acquire one of the given objects. If unable
+ to do so, it sleeps until an object becomes signaled, subsequently
+ acquiring it, or the timeout expires. In the latter case the ioctl
+ fails with ``ETIMEDOUT``. The function only acquires one object, even
+ if multiple objects are signaled.
+
+ A semaphore is considered to be signaled if its count is nonzero, and
+ is acquired by decrementing its count by one. A mutex is considered
+ to be signaled if it is unowned or if its owner matches the ``owner``
+ argument, and is acquired by incrementing its recursion count by one
+ and setting its owner to the ``owner`` argument. An auto-reset event
+ is acquired by designaling it; a manual-reset event is not affected
+ by acquisition.
+
+ Acquisition is atomic and totally ordered with respect to other
+ operations on the same object. If two wait operations (with different
+ ``owner`` identifiers) are queued on the same mutex, only one is
+ signaled. If two wait operations are queued on the same semaphore,
+ and a value of one is posted to it, only one is signaled. The order
+ in which threads are signaled is not specified.
+
+ If an abandoned mutex is acquired, the ioctl fails with
+ ``EOWNERDEAD``. Although this is a failure return, the function may
+ otherwise be considered successful. The mutex is marked as owned by
+ the given owner (with a recursion count of 1) and as no longer
+ abandoned, and ``index`` is still set to the index of the mutex.
+
+ The ``alert`` argument is an "extra" event which can terminate the
+ wait, independently of all other objects. If members of ``objs`` and
+ ``alert`` are both simultaneously signaled, a member of ``objs`` will
+ always be given priority and acquired first.
+
+ It is valid to pass the same object more than once, including by
+ passing the same event in the ``objs`` array and in ``alert``. If a
+ wakeup occurs due to that object being signaled, ``index`` is set to
+ the lowest index corresponding to that object.
+
+ The function may fail with ``EINTR`` if a signal is received.
+
+.. c:macro:: NTSYNC_IOC_WAIT_ALL
+
+ Poll on a list of objects, atomically acquiring all of them. Takes a
+ pointer to struct :c:type:`ntsync_wait_args`, which is used
+ identically to ``NTSYNC_IOC_WAIT_ANY``, except that ``index`` is
+ always filled with zero on success if not woken via alert.
+
+ This function attempts to simultaneously acquire all of the given
+ objects. If unable to do so, it sleeps until all objects become
+ simultaneously signaled, subsequently acquiring them, or the timeout
+ expires. In the latter case the ioctl fails with ``ETIMEDOUT`` and no
+ objects are modified.
+
+ Objects may become signaled and subsequently designaled (through
+ acquisition by other threads) while this thread is sleeping. Only
+ once all objects are simultaneously signaled does the ioctl acquire
+ them and return. The entire acquisition is atomic and totally ordered
+ with respect to other operations on any of the given objects.
+
+ If an abandoned mutex is acquired, the ioctl fails with
+ ``EOWNERDEAD``. Similarly to ``NTSYNC_IOC_WAIT_ANY``, all objects are
+ nevertheless marked as acquired. Note that if multiple mutex objects
+ are specified, there is no way to know which were marked as
+ abandoned.
+
+ As with "any" waits, the ``alert`` argument is an "extra" event which
+ can terminate the wait. Critically, however, an "all" wait will
+ succeed if all members in ``objs`` are signaled, *or* if ``alert`` is
+ signaled. In the latter case ``index`` will be set to ``count``. As
+ with "any" waits, if both conditions are filled, the former takes
+ priority, and objects in ``objs`` will be acquired.
+
+ Unlike ``NTSYNC_IOC_WAIT_ANY``, it is not valid to pass the same
+ object more than once, nor is it valid to pass the same object in
+ ``objs`` and in ``alert``. If this is attempted, the function fails
+ with ``EINVAL``.
--
2.43.0


2024-04-16 01:23:58

by Elizabeth Figura

[permalink] [raw]
Subject: [PATCH v4 16/27] selftests: ntsync: Add some tests for NTSYNC_IOC_WAIT_ANY.

Test basic synchronous functionality of NTSYNC_IOC_WAIT_ANY, when objects are
considered signaled or not signaled, and how they are affected by a successful
wait.

Signed-off-by: Elizabeth Figura <[email protected]>
---
.../testing/selftests/drivers/ntsync/ntsync.c | 119 ++++++++++++++++++
1 file changed, 119 insertions(+)

diff --git a/tools/testing/selftests/drivers/ntsync/ntsync.c b/tools/testing/selftests/drivers/ntsync/ntsync.c
index 7cd0f40594fd..40ad8cbd3138 100644
--- a/tools/testing/selftests/drivers/ntsync/ntsync.c
+++ b/tools/testing/selftests/drivers/ntsync/ntsync.c
@@ -342,4 +342,123 @@ TEST(mutex_state)
close(fd);
}

+TEST(test_wait_any)
+{
+ int objs[NTSYNC_MAX_WAIT_COUNT + 1], fd, ret;
+ struct ntsync_mutex_args mutex_args = {0};
+ struct ntsync_sem_args sem_args = {0};
+ __u32 owner, index, count, i;
+ struct timespec timeout;
+
+ clock_gettime(CLOCK_MONOTONIC, &timeout);
+
+ fd = open("/dev/ntsync", O_CLOEXEC | O_RDONLY);
+ ASSERT_LE(0, fd);
+
+ sem_args.count = 2;
+ sem_args.max = 3;
+ sem_args.sem = 0xdeadbeef;
+ ret = ioctl(fd, NTSYNC_IOC_CREATE_SEM, &sem_args);
+ EXPECT_EQ(0, ret);
+ EXPECT_NE(0xdeadbeef, sem_args.sem);
+
+ mutex_args.owner = 0;
+ mutex_args.count = 0;
+ mutex_args.mutex = 0xdeadbeef;
+ ret = ioctl(fd, NTSYNC_IOC_CREATE_MUTEX, &mutex_args);
+ EXPECT_EQ(0, ret);
+ EXPECT_NE(0xdeadbeef, mutex_args.mutex);
+
+ objs[0] = sem_args.sem;
+ objs[1] = mutex_args.mutex;
+
+ ret = wait_any(fd, 2, objs, 123, &index);
+ EXPECT_EQ(0, ret);
+ EXPECT_EQ(0, index);
+ check_sem_state(sem_args.sem, 1, 3);
+ check_mutex_state(mutex_args.mutex, 0, 0);
+
+ ret = wait_any(fd, 2, objs, 123, &index);
+ EXPECT_EQ(0, ret);
+ EXPECT_EQ(0, index);
+ check_sem_state(sem_args.sem, 0, 3);
+ check_mutex_state(mutex_args.mutex, 0, 0);
+
+ ret = wait_any(fd, 2, objs, 123, &index);
+ EXPECT_EQ(0, ret);
+ EXPECT_EQ(1, index);
+ check_sem_state(sem_args.sem, 0, 3);
+ check_mutex_state(mutex_args.mutex, 1, 123);
+
+ count = 1;
+ ret = post_sem(sem_args.sem, &count);
+ EXPECT_EQ(0, ret);
+ EXPECT_EQ(0, count);
+
+ ret = wait_any(fd, 2, objs, 123, &index);
+ EXPECT_EQ(0, ret);
+ EXPECT_EQ(0, index);
+ check_sem_state(sem_args.sem, 0, 3);
+ check_mutex_state(mutex_args.mutex, 1, 123);
+
+ ret = wait_any(fd, 2, objs, 123, &index);
+ EXPECT_EQ(0, ret);
+ EXPECT_EQ(1, index);
+ check_sem_state(sem_args.sem, 0, 3);
+ check_mutex_state(mutex_args.mutex, 2, 123);
+
+ ret = wait_any(fd, 2, objs, 456, &index);
+ EXPECT_EQ(-1, ret);
+ EXPECT_EQ(ETIMEDOUT, errno);
+
+ owner = 123;
+ ret = ioctl(mutex_args.mutex, NTSYNC_IOC_MUTEX_KILL, &owner);
+ EXPECT_EQ(0, ret);
+
+ ret = wait_any(fd, 2, objs, 456, &index);
+ EXPECT_EQ(-1, ret);
+ EXPECT_EQ(EOWNERDEAD, errno);
+ EXPECT_EQ(1, index);
+
+ ret = wait_any(fd, 2, objs, 456, &index);
+ EXPECT_EQ(0, ret);
+ EXPECT_EQ(1, index);
+
+ /* test waiting on the same object twice */
+ count = 2;
+ ret = post_sem(sem_args.sem, &count);
+ EXPECT_EQ(0, ret);
+ EXPECT_EQ(0, count);
+
+ objs[0] = objs[1] = sem_args.sem;
+ ret = wait_any(fd, 2, objs, 456, &index);
+ EXPECT_EQ(0, ret);
+ EXPECT_EQ(0, index);
+ check_sem_state(sem_args.sem, 1, 3);
+
+ ret = wait_any(fd, 0, NULL, 456, &index);
+ EXPECT_EQ(-1, ret);
+ EXPECT_EQ(ETIMEDOUT, errno);
+
+ for (i = 0; i < NTSYNC_MAX_WAIT_COUNT + 1; ++i)
+ objs[i] = sem_args.sem;
+
+ ret = wait_any(fd, NTSYNC_MAX_WAIT_COUNT, objs, 123, &index);
+ EXPECT_EQ(0, ret);
+ EXPECT_EQ(0, index);
+
+ ret = wait_any(fd, NTSYNC_MAX_WAIT_COUNT + 1, objs, 123, &index);
+ EXPECT_EQ(-1, ret);
+ EXPECT_EQ(EINVAL, errno);
+
+ ret = wait_any(fd, -1, objs, 123, &index);
+ EXPECT_EQ(-1, ret);
+ EXPECT_EQ(EINVAL, errno);
+
+ close(sem_args.sem);
+ close(mutex_args.mutex);
+
+ close(fd);
+}
+
TEST_HARNESS_MAIN
--
2.43.0


2024-04-16 01:33:31

by Elizabeth Figura

[permalink] [raw]
Subject: [PATCH v4 01/27] ntsync: Introduce NTSYNC_IOC_WAIT_ANY.

This corresponds to part of the functionality of the NT syscall
NtWaitForMultipleObjects(). Specifically, it implements the behaviour where
the third argument (wait_any) is TRUE, and it does not handle alertable waits.
Those features have been split out into separate patches to ease review.

This patch therefore implements the wait/wake infrastructure which comprises the
core of ntsync's functionality.

NTSYNC_IOC_WAIT_ANY is a vectored wait function similar to poll(). Unlike
poll(), it "consumes" objects when they are signaled. For semaphores, this means
decreasing one from the internal counter. At most one object can be consumed by
this function.

This wait/wake model is fundamentally different from that used anywhere else in
the kernel, and for that reason ntsync does not use any existing infrastructure,
such as futexes, kernel mutexes or semaphores, or wait_event().

Up to 64 objects can be waited on at once. As soon as one is signaled, the
object with the lowest index is consumed, and that index is returned via the
"index" field.

A timeout is supported. The timeout is passed as a u64 nanosecond value, which
represents absolute time measured against either the MONOTONIC or REALTIME clock
(controlled by the flags argument). If U64_MAX is passed, the ioctl waits
indefinitely.

This ioctl validates that all objects belong to the relevant device. This is not
necessary for any technical reason related to NTSYNC_IOC_WAIT_ANY, but will be
necessary for NTSYNC_IOC_WAIT_ALL introduced in the following patch.

Two u32s of padding are left in the ntsync_wait_args structure; one will be used
by a patch later in the series (which is split out to ease review).

Signed-off-by: Elizabeth Figura <[email protected]>
---
drivers/misc/ntsync.c | 250 ++++++++++++++++++++++++++++++++++++
include/uapi/linux/ntsync.h | 16 +++
2 files changed, 266 insertions(+)

diff --git a/drivers/misc/ntsync.c b/drivers/misc/ntsync.c
index 3c2f743c58b0..c6f84a5fc8c0 100644
--- a/drivers/misc/ntsync.c
+++ b/drivers/misc/ntsync.c
@@ -6,11 +6,16 @@
*/

#include <linux/anon_inodes.h>
+#include <linux/atomic.h>
#include <linux/file.h>
#include <linux/fs.h>
+#include <linux/hrtimer.h>
+#include <linux/ktime.h>
#include <linux/miscdevice.h>
#include <linux/module.h>
#include <linux/overflow.h>
+#include <linux/sched.h>
+#include <linux/sched/signal.h>
#include <linux/slab.h>
#include <linux/spinlock.h>
#include <uapi/linux/ntsync.h>
@@ -30,6 +35,8 @@ enum ntsync_type {
*
* Both rely on struct file for reference counting. Individual
* ntsync_obj objects take a reference to the device when created.
+ * Wait operations take a reference to each object being waited on for
+ * the duration of the wait.
*/

struct ntsync_obj {
@@ -47,12 +54,56 @@ struct ntsync_obj {
__u32 max;
} sem;
} u;
+
+ struct list_head any_waiters;
+};
+
+struct ntsync_q_entry {
+ struct list_head node;
+ struct ntsync_q *q;
+ struct ntsync_obj *obj;
+ __u32 index;
+};
+
+struct ntsync_q {
+ struct task_struct *task;
+ __u32 owner;
+
+ /*
+ * Protected via atomic_try_cmpxchg(). Only the thread that wins the
+ * compare-and-swap may actually change object states and wake this
+ * task.
+ */
+ atomic_t signaled;
+
+ __u32 count;
+ struct ntsync_q_entry entries[];
};

struct ntsync_device {
struct file *file;
};

+static void try_wake_any_sem(struct ntsync_obj *sem)
+{
+ struct ntsync_q_entry *entry;
+
+ lockdep_assert_held(&sem->lock);
+
+ list_for_each_entry(entry, &sem->any_waiters, node) {
+ struct ntsync_q *q = entry->q;
+ int signaled = -1;
+
+ if (!sem->u.sem.count)
+ break;
+
+ if (atomic_try_cmpxchg(&q->signaled, &signaled, entry->index)) {
+ sem->u.sem.count--;
+ wake_up_process(q->task);
+ }
+ }
+}
+
/*
* Actually change the semaphore state, returning -EOVERFLOW if it is made
* invalid.
@@ -88,6 +139,8 @@ static int ntsync_sem_post(struct ntsync_obj *sem, void __user *argp)

prev_count = sem->u.sem.count;
ret = post_sem_state(sem, args);
+ if (!ret)
+ try_wake_any_sem(sem);

spin_unlock(&sem->lock);

@@ -141,6 +194,7 @@ static struct ntsync_obj *ntsync_alloc_obj(struct ntsync_device *dev,
obj->dev = dev;
get_file(dev->file);
spin_lock_init(&obj->lock);
+ INIT_LIST_HEAD(&obj->any_waiters);

return obj;
}
@@ -191,6 +245,200 @@ static int ntsync_create_sem(struct ntsync_device *dev, void __user *argp)
return put_user(fd, &user_args->sem);
}

+static struct ntsync_obj *get_obj(struct ntsync_device *dev, int fd)
+{
+ struct file *file = fget(fd);
+ struct ntsync_obj *obj;
+
+ if (!file)
+ return NULL;
+
+ if (file->f_op != &ntsync_obj_fops) {
+ fput(file);
+ return NULL;
+ }
+
+ obj = file->private_data;
+ if (obj->dev != dev) {
+ fput(file);
+ return NULL;
+ }
+
+ return obj;
+}
+
+static void put_obj(struct ntsync_obj *obj)
+{
+ fput(obj->file);
+}
+
+static int ntsync_schedule(const struct ntsync_q *q, const struct ntsync_wait_args *args)
+{
+ ktime_t timeout = ns_to_ktime(args->timeout);
+ clockid_t clock = CLOCK_MONOTONIC;
+ ktime_t *timeout_ptr;
+ int ret = 0;
+
+ timeout_ptr = (args->timeout == U64_MAX ? NULL : &timeout);
+
+ if (args->flags & NTSYNC_WAIT_REALTIME)
+ clock = CLOCK_REALTIME;
+
+ do {
+ if (signal_pending(current)) {
+ ret = -ERESTARTSYS;
+ break;
+ }
+
+ set_current_state(TASK_INTERRUPTIBLE);
+ if (atomic_read(&q->signaled) != -1) {
+ ret = 0;
+ break;
+ }
+ ret = schedule_hrtimeout_range_clock(timeout_ptr, 0, HRTIMER_MODE_ABS, clock);
+ } while (ret < 0);
+ __set_current_state(TASK_RUNNING);
+
+ return ret;
+}
+
+/*
+ * Allocate and initialize the ntsync_q structure, but do not queue us yet.
+ */
+static int setup_wait(struct ntsync_device *dev,
+ const struct ntsync_wait_args *args,
+ struct ntsync_q **ret_q)
+{
+ const __u32 count = args->count;
+ int fds[NTSYNC_MAX_WAIT_COUNT];
+ struct ntsync_q *q;
+ __u32 i, j;
+
+ if (!args->owner)
+ return -EINVAL;
+
+ if (args->pad || args->pad2 || (args->flags & ~NTSYNC_WAIT_REALTIME))
+ return -EINVAL;
+
+ if (args->count > NTSYNC_MAX_WAIT_COUNT)
+ return -EINVAL;
+
+ if (copy_from_user(fds, u64_to_user_ptr(args->objs),
+ array_size(count, sizeof(*fds))))
+ return -EFAULT;
+
+ q = kmalloc(struct_size(q, entries, count), GFP_KERNEL);
+ if (!q)
+ return -ENOMEM;
+ q->task = current;
+ q->owner = args->owner;
+ atomic_set(&q->signaled, -1);
+ q->count = count;
+
+ for (i = 0; i < count; i++) {
+ struct ntsync_q_entry *entry = &q->entries[i];
+ struct ntsync_obj *obj = get_obj(dev, fds[i]);
+
+ if (!obj)
+ goto err;
+
+ entry->obj = obj;
+ entry->q = q;
+ entry->index = i;
+ }
+
+ *ret_q = q;
+ return 0;
+
+err:
+ for (j = 0; j < i; j++)
+ put_obj(q->entries[j].obj);
+ kfree(q);
+ return -EINVAL;
+}
+
+static void try_wake_any_obj(struct ntsync_obj *obj)
+{
+ switch (obj->type) {
+ case NTSYNC_TYPE_SEM:
+ try_wake_any_sem(obj);
+ break;
+ }
+}
+
+static int ntsync_wait_any(struct ntsync_device *dev, void __user *argp)
+{
+ struct ntsync_wait_args args;
+ struct ntsync_q *q;
+ int signaled;
+ __u32 i;
+ int ret;
+
+ if (copy_from_user(&args, argp, sizeof(args)))
+ return -EFAULT;
+
+ ret = setup_wait(dev, &args, &q);
+ if (ret < 0)
+ return ret;
+
+ /* queue ourselves */
+
+ for (i = 0; i < args.count; i++) {
+ struct ntsync_q_entry *entry = &q->entries[i];
+ struct ntsync_obj *obj = entry->obj;
+
+ spin_lock(&obj->lock);
+ list_add_tail(&entry->node, &obj->any_waiters);
+ spin_unlock(&obj->lock);
+ }
+
+ /* check if we are already signaled */
+
+ for (i = 0; i < args.count; i++) {
+ struct ntsync_obj *obj = q->entries[i].obj;
+
+ if (atomic_read(&q->signaled) != -1)
+ break;
+
+ spin_lock(&obj->lock);
+ try_wake_any_obj(obj);
+ spin_unlock(&obj->lock);
+ }
+
+ /* sleep */
+
+ ret = ntsync_schedule(q, &args);
+
+ /* and finally, unqueue */
+
+ for (i = 0; i < args.count; i++) {
+ struct ntsync_q_entry *entry = &q->entries[i];
+ struct ntsync_obj *obj = entry->obj;
+
+ spin_lock(&obj->lock);
+ list_del(&entry->node);
+ spin_unlock(&obj->lock);
+
+ put_obj(obj);
+ }
+
+ signaled = atomic_read(&q->signaled);
+ if (signaled != -1) {
+ struct ntsync_wait_args __user *user_args = argp;
+
+ /* even if we caught a signal, we need to communicate success */
+ ret = 0;
+
+ if (put_user(signaled, &user_args->index))
+ ret = -EFAULT;
+ } else if (!ret) {
+ ret = -ETIMEDOUT;
+ }
+
+ kfree(q);
+ return ret;
+}
+
static int ntsync_char_open(struct inode *inode, struct file *file)
{
struct ntsync_device *dev;
@@ -222,6 +470,8 @@ static long ntsync_char_ioctl(struct file *file, unsigned int cmd,
switch (cmd) {
case NTSYNC_IOC_CREATE_SEM:
return ntsync_create_sem(dev, argp);
+ case NTSYNC_IOC_WAIT_ANY:
+ return ntsync_wait_any(dev, argp);
default:
return -ENOIOCTLCMD;
}
diff --git a/include/uapi/linux/ntsync.h b/include/uapi/linux/ntsync.h
index dcfa38fdc93c..60ad414b5552 100644
--- a/include/uapi/linux/ntsync.h
+++ b/include/uapi/linux/ntsync.h
@@ -16,7 +16,23 @@ struct ntsync_sem_args {
__u32 max;
};

+#define NTSYNC_WAIT_REALTIME 0x1
+
+struct ntsync_wait_args {
+ __u64 timeout;
+ __u64 objs;
+ __u32 count;
+ __u32 owner;
+ __u32 index;
+ __u32 flags;
+ __u32 pad;
+ __u32 pad2;
+};
+
+#define NTSYNC_MAX_WAIT_COUNT 64
+
#define NTSYNC_IOC_CREATE_SEM _IOWR('N', 0x80, struct ntsync_sem_args)
+#define NTSYNC_IOC_WAIT_ANY _IOWR('N', 0x82, struct ntsync_wait_args)

#define NTSYNC_IOC_SEM_POST _IOWR('N', 0x81, __u32)

--
2.43.0


2024-04-16 01:37:00

by Elizabeth Figura

[permalink] [raw]
Subject: [PATCH v4 17/27] selftests: ntsync: Add some tests for NTSYNC_IOC_WAIT_ALL.

Test basic synchronous functionality of NTSYNC_IOC_WAIT_ALL, and when objects
are considered simultaneously signaled.

Signed-off-by: Elizabeth Figura <[email protected]>
---
.../testing/selftests/drivers/ntsync/ntsync.c | 99 ++++++++++++++++++-
1 file changed, 97 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/drivers/ntsync/ntsync.c b/tools/testing/selftests/drivers/ntsync/ntsync.c
index 40ad8cbd3138..c0f372167557 100644
--- a/tools/testing/selftests/drivers/ntsync/ntsync.c
+++ b/tools/testing/selftests/drivers/ntsync/ntsync.c
@@ -73,7 +73,8 @@ static int unlock_mutex(int mutex, __u32 owner, __u32 *count)
return ret;
}

-static int wait_any(int fd, __u32 count, const int *objs, __u32 owner, __u32 *index)
+static int wait_objs(int fd, unsigned long request, __u32 count,
+ const int *objs, __u32 owner, __u32 *index)
{
struct ntsync_wait_args args = {0};
struct timespec timeout;
@@ -86,11 +87,21 @@ static int wait_any(int fd, __u32 count, const int *objs, __u32 owner, __u32 *in
args.objs = (uintptr_t)objs;
args.owner = owner;
args.index = 0xdeadbeef;
- ret = ioctl(fd, NTSYNC_IOC_WAIT_ANY, &args);
+ ret = ioctl(fd, request, &args);
*index = args.index;
return ret;
}

+static int wait_any(int fd, __u32 count, const int *objs, __u32 owner, __u32 *index)
+{
+ return wait_objs(fd, NTSYNC_IOC_WAIT_ANY, count, objs, owner, index);
+}
+
+static int wait_all(int fd, __u32 count, const int *objs, __u32 owner, __u32 *index)
+{
+ return wait_objs(fd, NTSYNC_IOC_WAIT_ALL, count, objs, owner, index);
+}
+
TEST(semaphore_state)
{
struct ntsync_sem_args sem_args;
@@ -461,4 +472,88 @@ TEST(test_wait_any)
close(fd);
}

+TEST(test_wait_all)
+{
+ struct ntsync_mutex_args mutex_args = {0};
+ struct ntsync_sem_args sem_args = {0};
+ __u32 owner, index, count;
+ int objs[2], fd, ret;
+
+ fd = open("/dev/ntsync", O_CLOEXEC | O_RDONLY);
+ ASSERT_LE(0, fd);
+
+ sem_args.count = 2;
+ sem_args.max = 3;
+ sem_args.sem = 0xdeadbeef;
+ ret = ioctl(fd, NTSYNC_IOC_CREATE_SEM, &sem_args);
+ EXPECT_EQ(0, ret);
+ EXPECT_NE(0xdeadbeef, sem_args.sem);
+
+ mutex_args.owner = 0;
+ mutex_args.count = 0;
+ mutex_args.mutex = 0xdeadbeef;
+ ret = ioctl(fd, NTSYNC_IOC_CREATE_MUTEX, &mutex_args);
+ EXPECT_EQ(0, ret);
+ EXPECT_NE(0xdeadbeef, mutex_args.mutex);
+
+ objs[0] = sem_args.sem;
+ objs[1] = mutex_args.mutex;
+
+ ret = wait_all(fd, 2, objs, 123, &index);
+ EXPECT_EQ(0, ret);
+ EXPECT_EQ(0, index);
+ check_sem_state(sem_args.sem, 1, 3);
+ check_mutex_state(mutex_args.mutex, 1, 123);
+
+ ret = wait_all(fd, 2, objs, 456, &index);
+ EXPECT_EQ(-1, ret);
+ EXPECT_EQ(ETIMEDOUT, errno);
+ check_sem_state(sem_args.sem, 1, 3);
+ check_mutex_state(mutex_args.mutex, 1, 123);
+
+ ret = wait_all(fd, 2, objs, 123, &index);
+ EXPECT_EQ(0, ret);
+ EXPECT_EQ(0, index);
+ check_sem_state(sem_args.sem, 0, 3);
+ check_mutex_state(mutex_args.mutex, 2, 123);
+
+ ret = wait_all(fd, 2, objs, 123, &index);
+ EXPECT_EQ(-1, ret);
+ EXPECT_EQ(ETIMEDOUT, errno);
+ check_sem_state(sem_args.sem, 0, 3);
+ check_mutex_state(mutex_args.mutex, 2, 123);
+
+ count = 3;
+ ret = post_sem(sem_args.sem, &count);
+ EXPECT_EQ(0, ret);
+ EXPECT_EQ(0, count);
+
+ ret = wait_all(fd, 2, objs, 123, &index);
+ EXPECT_EQ(0, ret);
+ EXPECT_EQ(0, index);
+ check_sem_state(sem_args.sem, 2, 3);
+ check_mutex_state(mutex_args.mutex, 3, 123);
+
+ owner = 123;
+ ret = ioctl(mutex_args.mutex, NTSYNC_IOC_MUTEX_KILL, &owner);
+ EXPECT_EQ(0, ret);
+
+ ret = wait_all(fd, 2, objs, 123, &index);
+ EXPECT_EQ(-1, ret);
+ EXPECT_EQ(EOWNERDEAD, errno);
+ check_sem_state(sem_args.sem, 1, 3);
+ check_mutex_state(mutex_args.mutex, 1, 123);
+
+ /* test waiting on the same object twice */
+ objs[0] = objs[1] = sem_args.sem;
+ ret = wait_all(fd, 2, objs, 123, &index);
+ EXPECT_EQ(-1, ret);
+ EXPECT_EQ(EINVAL, errno);
+
+ close(sem_args.sem);
+ close(mutex_args.mutex);
+
+ close(fd);
+}
+
TEST_HARNESS_MAIN
--
2.43.0


2024-04-16 01:37:09

by Elizabeth Figura

[permalink] [raw]
Subject: [PATCH v4 23/27] selftests: ntsync: Add tests for alertable waits.

Test the "alert" functionality of NTSYNC_IOC_WAIT_ALL and NTSYNC_IOC_WAIT_ANY,
when a wait is woken with an alert and when it is woken by an object.

Signed-off-by: Elizabeth Figura <[email protected]>
---
.../testing/selftests/drivers/ntsync/ntsync.c | 179 +++++++++++++++++-
1 file changed, 176 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/drivers/ntsync/ntsync.c b/tools/testing/selftests/drivers/ntsync/ntsync.c
index 5d17eff6a370..5465a16d38b3 100644
--- a/tools/testing/selftests/drivers/ntsync/ntsync.c
+++ b/tools/testing/selftests/drivers/ntsync/ntsync.c
@@ -95,7 +95,7 @@ static int read_event_state(int event, __u32 *signaled, __u32 *manual)
})

static int wait_objs(int fd, unsigned long request, __u32 count,
- const int *objs, __u32 owner, __u32 *index)
+ const int *objs, __u32 owner, int alert, __u32 *index)
{
struct ntsync_wait_args args = {0};
struct timespec timeout;
@@ -108,6 +108,7 @@ static int wait_objs(int fd, unsigned long request, __u32 count,
args.objs = (uintptr_t)objs;
args.owner = owner;
args.index = 0xdeadbeef;
+ args.alert = alert;
ret = ioctl(fd, request, &args);
*index = args.index;
return ret;
@@ -115,12 +116,26 @@ static int wait_objs(int fd, unsigned long request, __u32 count,

static int wait_any(int fd, __u32 count, const int *objs, __u32 owner, __u32 *index)
{
- return wait_objs(fd, NTSYNC_IOC_WAIT_ANY, count, objs, owner, index);
+ return wait_objs(fd, NTSYNC_IOC_WAIT_ANY, count, objs, owner, 0, index);
}

static int wait_all(int fd, __u32 count, const int *objs, __u32 owner, __u32 *index)
{
- return wait_objs(fd, NTSYNC_IOC_WAIT_ALL, count, objs, owner, index);
+ return wait_objs(fd, NTSYNC_IOC_WAIT_ALL, count, objs, owner, 0, index);
+}
+
+static int wait_any_alert(int fd, __u32 count, const int *objs,
+ __u32 owner, int alert, __u32 *index)
+{
+ return wait_objs(fd, NTSYNC_IOC_WAIT_ANY,
+ count, objs, owner, alert, index);
+}
+
+static int wait_all_alert(int fd, __u32 count, const int *objs,
+ __u32 owner, int alert, __u32 *index)
+{
+ return wait_objs(fd, NTSYNC_IOC_WAIT_ALL,
+ count, objs, owner, alert, index);
}

TEST(semaphore_state)
@@ -1095,4 +1110,162 @@ TEST(wake_all)
close(fd);
}

+TEST(alert_any)
+{
+ struct ntsync_event_args event_args = {0};
+ struct ntsync_sem_args sem_args = {0};
+ __u32 index, count, signaled;
+ int objs[2], fd, ret;
+
+ fd = open("/dev/ntsync", O_CLOEXEC | O_RDONLY);
+ ASSERT_LE(0, fd);
+
+ sem_args.count = 0;
+ sem_args.max = 2;
+ sem_args.sem = 0xdeadbeef;
+ ret = ioctl(fd, NTSYNC_IOC_CREATE_SEM, &sem_args);
+ EXPECT_EQ(0, ret);
+ EXPECT_NE(0xdeadbeef, sem_args.sem);
+ objs[0] = sem_args.sem;
+
+ sem_args.count = 1;
+ sem_args.max = 2;
+ sem_args.sem = 0xdeadbeef;
+ ret = ioctl(fd, NTSYNC_IOC_CREATE_SEM, &sem_args);
+ EXPECT_EQ(0, ret);
+ EXPECT_NE(0xdeadbeef, sem_args.sem);
+ objs[1] = sem_args.sem;
+
+ event_args.manual = true;
+ event_args.signaled = true;
+ ret = ioctl(fd, NTSYNC_IOC_CREATE_EVENT, &event_args);
+ EXPECT_EQ(0, ret);
+
+ ret = wait_any_alert(fd, 0, NULL, 123, event_args.event, &index);
+ EXPECT_EQ(0, ret);
+ EXPECT_EQ(0, index);
+
+ ret = ioctl(event_args.event, NTSYNC_IOC_EVENT_RESET, &signaled);
+ EXPECT_EQ(0, ret);
+
+ ret = wait_any_alert(fd, 0, NULL, 123, event_args.event, &index);
+ EXPECT_EQ(-1, ret);
+ EXPECT_EQ(ETIMEDOUT, errno);
+
+ ret = ioctl(event_args.event, NTSYNC_IOC_EVENT_SET, &signaled);
+ EXPECT_EQ(0, ret);
+
+ ret = wait_any_alert(fd, 2, objs, 123, event_args.event, &index);
+ EXPECT_EQ(0, ret);
+ EXPECT_EQ(1, index);
+
+ ret = wait_any_alert(fd, 2, objs, 123, event_args.event, &index);
+ EXPECT_EQ(0, ret);
+ EXPECT_EQ(2, index);
+
+ close(event_args.event);
+
+ /* test with an auto-reset event */
+
+ event_args.manual = false;
+ event_args.signaled = true;
+ ret = ioctl(fd, NTSYNC_IOC_CREATE_EVENT, &event_args);
+ EXPECT_EQ(0, ret);
+
+ count = 1;
+ ret = post_sem(objs[0], &count);
+ EXPECT_EQ(0, ret);
+
+ ret = wait_any_alert(fd, 2, objs, 123, event_args.event, &index);
+ EXPECT_EQ(0, ret);
+ EXPECT_EQ(0, index);
+
+ ret = wait_any_alert(fd, 2, objs, 123, event_args.event, &index);
+ EXPECT_EQ(0, ret);
+ EXPECT_EQ(2, index);
+
+ ret = wait_any_alert(fd, 2, objs, 123, event_args.event, &index);
+ EXPECT_EQ(-1, ret);
+ EXPECT_EQ(ETIMEDOUT, errno);
+
+ close(event_args.event);
+
+ close(objs[0]);
+ close(objs[1]);
+
+ close(fd);
+}
+
+TEST(alert_all)
+{
+ struct ntsync_event_args event_args = {0};
+ struct ntsync_sem_args sem_args = {0};
+ __u32 index, count, signaled;
+ int objs[2], fd, ret;
+
+ fd = open("/dev/ntsync", O_CLOEXEC | O_RDONLY);
+ ASSERT_LE(0, fd);
+
+ sem_args.count = 2;
+ sem_args.max = 2;
+ sem_args.sem = 0xdeadbeef;
+ ret = ioctl(fd, NTSYNC_IOC_CREATE_SEM, &sem_args);
+ EXPECT_EQ(0, ret);
+ EXPECT_NE(0xdeadbeef, sem_args.sem);
+ objs[0] = sem_args.sem;
+
+ sem_args.count = 1;
+ sem_args.max = 2;
+ sem_args.sem = 0xdeadbeef;
+ ret = ioctl(fd, NTSYNC_IOC_CREATE_SEM, &sem_args);
+ EXPECT_EQ(0, ret);
+ EXPECT_NE(0xdeadbeef, sem_args.sem);
+ objs[1] = sem_args.sem;
+
+ event_args.manual = true;
+ event_args.signaled = true;
+ ret = ioctl(fd, NTSYNC_IOC_CREATE_EVENT, &event_args);
+ EXPECT_EQ(0, ret);
+
+ ret = wait_all_alert(fd, 2, objs, 123, event_args.event, &index);
+ EXPECT_EQ(0, ret);
+ EXPECT_EQ(0, index);
+
+ ret = wait_all_alert(fd, 2, objs, 123, event_args.event, &index);
+ EXPECT_EQ(0, ret);
+ EXPECT_EQ(2, index);
+
+ close(event_args.event);
+
+ /* test with an auto-reset event */
+
+ event_args.manual = false;
+ event_args.signaled = true;
+ ret = ioctl(fd, NTSYNC_IOC_CREATE_EVENT, &event_args);
+ EXPECT_EQ(0, ret);
+
+ count = 2;
+ ret = post_sem(objs[1], &count);
+ EXPECT_EQ(0, ret);
+
+ ret = wait_all_alert(fd, 2, objs, 123, event_args.event, &index);
+ EXPECT_EQ(0, ret);
+ EXPECT_EQ(0, index);
+
+ ret = wait_all_alert(fd, 2, objs, 123, event_args.event, &index);
+ EXPECT_EQ(0, ret);
+ EXPECT_EQ(2, index);
+
+ ret = wait_all_alert(fd, 2, objs, 123, event_args.event, &index);
+ EXPECT_EQ(-1, ret);
+ EXPECT_EQ(ETIMEDOUT, errno);
+
+ close(event_args.event);
+
+ close(objs[0]);
+ close(objs[1]);
+
+ close(fd);
+}
+
TEST_HARNESS_MAIN
--
2.43.0


2024-04-16 01:37:20

by Elizabeth Figura

[permalink] [raw]
Subject: [PATCH v4 24/27] selftests: ntsync: Add some tests for wakeup signaling via alerts.

Expand the alert tests to cover alerting a thread mid-wait, to test that the
relevant scheduling logic works correctly.

Signed-off-by: Elizabeth Figura <[email protected]>
---
.../testing/selftests/drivers/ntsync/ntsync.c | 62 +++++++++++++++++++
1 file changed, 62 insertions(+)

diff --git a/tools/testing/selftests/drivers/ntsync/ntsync.c b/tools/testing/selftests/drivers/ntsync/ntsync.c
index 5465a16d38b3..968874d7e325 100644
--- a/tools/testing/selftests/drivers/ntsync/ntsync.c
+++ b/tools/testing/selftests/drivers/ntsync/ntsync.c
@@ -1113,9 +1113,12 @@ TEST(wake_all)
TEST(alert_any)
{
struct ntsync_event_args event_args = {0};
+ struct ntsync_wait_args wait_args = {0};
struct ntsync_sem_args sem_args = {0};
__u32 index, count, signaled;
+ struct wait_args thread_args;
int objs[2], fd, ret;
+ pthread_t thread;

fd = open("/dev/ntsync", O_CLOEXEC | O_RDONLY);
ASSERT_LE(0, fd);
@@ -1163,6 +1166,34 @@ TEST(alert_any)
EXPECT_EQ(0, ret);
EXPECT_EQ(2, index);

+ /* test wakeup via alert */
+
+ ret = ioctl(event_args.event, NTSYNC_IOC_EVENT_RESET, &signaled);
+ EXPECT_EQ(0, ret);
+
+ wait_args.timeout = get_abs_timeout(1000);
+ wait_args.objs = (uintptr_t)objs;
+ wait_args.count = 2;
+ wait_args.owner = 123;
+ wait_args.index = 0xdeadbeef;
+ wait_args.alert = event_args.event;
+ thread_args.fd = fd;
+ thread_args.args = &wait_args;
+ thread_args.request = NTSYNC_IOC_WAIT_ANY;
+ ret = pthread_create(&thread, NULL, wait_thread, &thread_args);
+ EXPECT_EQ(0, ret);
+
+ ret = wait_for_thread(thread, 100);
+ EXPECT_EQ(ETIMEDOUT, ret);
+
+ ret = ioctl(event_args.event, NTSYNC_IOC_EVENT_SET, &signaled);
+ EXPECT_EQ(0, ret);
+
+ ret = wait_for_thread(thread, 100);
+ EXPECT_EQ(0, ret);
+ EXPECT_EQ(0, thread_args.ret);
+ EXPECT_EQ(2, wait_args.index);
+
close(event_args.event);

/* test with an auto-reset event */
@@ -1199,9 +1230,12 @@ TEST(alert_any)
TEST(alert_all)
{
struct ntsync_event_args event_args = {0};
+ struct ntsync_wait_args wait_args = {0};
struct ntsync_sem_args sem_args = {0};
+ struct wait_args thread_args;
__u32 index, count, signaled;
int objs[2], fd, ret;
+ pthread_t thread;

fd = open("/dev/ntsync", O_CLOEXEC | O_RDONLY);
ASSERT_LE(0, fd);
@@ -1235,6 +1269,34 @@ TEST(alert_all)
EXPECT_EQ(0, ret);
EXPECT_EQ(2, index);

+ /* test wakeup via alert */
+
+ ret = ioctl(event_args.event, NTSYNC_IOC_EVENT_RESET, &signaled);
+ EXPECT_EQ(0, ret);
+
+ wait_args.timeout = get_abs_timeout(1000);
+ wait_args.objs = (uintptr_t)objs;
+ wait_args.count = 2;
+ wait_args.owner = 123;
+ wait_args.index = 0xdeadbeef;
+ wait_args.alert = event_args.event;
+ thread_args.fd = fd;
+ thread_args.args = &wait_args;
+ thread_args.request = NTSYNC_IOC_WAIT_ALL;
+ ret = pthread_create(&thread, NULL, wait_thread, &thread_args);
+ EXPECT_EQ(0, ret);
+
+ ret = wait_for_thread(thread, 100);
+ EXPECT_EQ(ETIMEDOUT, ret);
+
+ ret = ioctl(event_args.event, NTSYNC_IOC_EVENT_SET, &signaled);
+ EXPECT_EQ(0, ret);
+
+ ret = wait_for_thread(thread, 100);
+ EXPECT_EQ(0, ret);
+ EXPECT_EQ(0, thread_args.ret);
+ EXPECT_EQ(2, wait_args.index);
+
close(event_args.event);

/* test with an auto-reset event */
--
2.43.0


2024-04-16 02:14:27

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH v4 27/27] docs: ntsync: Add documentation for the ntsync uAPI.



On 4/15/24 6:08 PM, Elizabeth Figura wrote:
> Add an overall explanation of the driver architecture, and complete and precise
> specification for its intended behaviour.
>
> Reviewed-by: Bagas Sanjaya <[email protected]>
> Signed-off-by: Elizabeth Figura <[email protected]>

Tested-by: Randy Dunlap <[email protected]>

Thanks.

> ---
> Documentation/userspace-api/index.rst | 1 +
> Documentation/userspace-api/ntsync.rst | 399 +++++++++++++++++++++++++
> 2 files changed, 400 insertions(+)
> create mode 100644 Documentation/userspace-api/ntsync.rst

--
#Randy
https://people.kernel.org/tglx/notes-about-netiquette
https://subspace.kernel.org/etiquette.html

2024-04-16 08:14:57

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 00/30] NT synchronization primitive driver

On Mon, Apr 15, 2024 at 08:08:10PM -0500, Elizabeth Figura wrote:
> This patch series implements a new char misc driver, /dev/ntsync, which is used
> to implement Windows NT synchronization primitives.

This patch series does not apply to anything I have at hand. Nor does it
state anything explicit to put it on top of.


> Hence I would like to request review from someone familiar with locking to make
> sure that the usage of low-level kernel primitives is correct and that the wait
> queues work as intended, and to that end I've CC'd the locking maintainers.

I am sadly very limited atm, but I'll try and read through it. If only I
could apply...



> == Patches ==
>
> The intended semantics of the patches are broadly intended to match those of the
> corresponding Windows functions. For those not already familiar with the Windows
> functions (or their undocumented behaviour), patch 27/27 provides a detailed
> specification, and individual patches also include a brief description of the
> API they are implementing.
>
> The patches making use of this driver in Wine can be retrieved or browsed here:
>
> https://repo.or.cz/wine/zf.git/shortlog/refs/heads/ntsync5

I don't support GE has it in his builds? Last time I tried, building
Wine was a bit of a pain.


> Some aspects of the implementation may deserve particular comment:
>
> * In the interest of performance, each object is governed only by a single
> spinlock. However, NTSYNC_IOC_WAIT_ALL requires that the state of multiple
> objects be changed as a single atomic operation. In order to achieve this, we
> first take a device-wide lock ("wait_all_lock") any time we are going to lock
> more than one object at a time.
>
> The maximum number of objects that can be used in a vectored wait, and
> therefore the maximum that can be locked simultaneously, is 64. This number is
> NT's own limit.
>
> The acquisition of multiple spinlocks will degrade performance. This is a
> conscious choice, however. Wait-for-all is known to be a very rare operation
> in practice, especially with counts that approach the maximum, and it is the
> intent of the ntsync driver to optimize wait-for-any at the expense of
> wait-for-all as much as possible.

Per the example of percpu-rwsem, it would be possible to create a
mutex-spinlock hybrid scheme, where single locks are spinlocks while
held, but can block when the global thing is pending. And the global
lock is always mutex like.

If all that is worth it, I don't know. Nesting 64 spinlocks doesn't give
me warm and fuzzy feelings though.

2024-04-16 09:02:58

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH v4 00/30] NT synchronization primitive driver

On Tue, Apr 16, 2024 at 10:14:21AM +0200, Peter Zijlstra wrote:
> On Mon, Apr 15, 2024 at 08:08:10PM -0500, Elizabeth Figura wrote:
> > This patch series implements a new char misc driver, /dev/ntsync, which is used
> > to implement Windows NT synchronization primitives.
>
> This patch series does not apply to anything I have at hand. Nor does it
> state anything explicit to put it on top of.

Should work on linux-next as I took a few of the original commits from
the last series already.

thanks,

greg k-h

2024-04-16 15:50:45

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 00/30] NT synchronization primitive driver

On Tue, Apr 16, 2024 at 10:14:21AM +0200, Peter Zijlstra wrote:

> > Some aspects of the implementation may deserve particular comment:
> >
> > * In the interest of performance, each object is governed only by a single
> > spinlock. However, NTSYNC_IOC_WAIT_ALL requires that the state of multiple
> > objects be changed as a single atomic operation. In order to achieve this, we
> > first take a device-wide lock ("wait_all_lock") any time we are going to lock
> > more than one object at a time.
> >
> > The maximum number of objects that can be used in a vectored wait, and
> > therefore the maximum that can be locked simultaneously, is 64. This number is
> > NT's own limit.

AFAICT:

spin_lock(&dev->wait_all_lock);
list_for_each_entry(entry, &obj->all_waiters, node)
for (i=0; i<count; i++)
spin_lock_nest_lock(q->entries[i].obj->lock, &dev->wait_all_lock);

Where @count <= NTSYNC_MAX_WAIT_COUNT.

So while this nests at most 65 spinlocks, there is no actual bound on
the amount of nested lock sections in total. That is, all_waiters list
can be grown without limits.

Can we pretty please make wait_all_lock a mutex ?

> > The acquisition of multiple spinlocks will degrade performance. This is a
> > conscious choice, however. Wait-for-all is known to be a very rare operation
> > in practice, especially with counts that approach the maximum, and it is the
> > intent of the ntsync driver to optimize wait-for-any at the expense of
> > wait-for-all as much as possible.

Typical sane usage is a good guide for performance, but you must not
forget about malicious userspace and what they can do on purpose to mess
you up.


Anyway, let me stare more at all this....

2024-04-16 16:00:04

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 00/30] NT synchronization primitive driver

On Tue, Apr 16, 2024 at 05:50:14PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 16, 2024 at 10:14:21AM +0200, Peter Zijlstra wrote:
>
> > > Some aspects of the implementation may deserve particular comment:
> > >
> > > * In the interest of performance, each object is governed only by a single
> > > spinlock. However, NTSYNC_IOC_WAIT_ALL requires that the state of multiple
> > > objects be changed as a single atomic operation. In order to achieve this, we
> > > first take a device-wide lock ("wait_all_lock") any time we are going to lock
> > > more than one object at a time.
> > >
> > > The maximum number of objects that can be used in a vectored wait, and
> > > therefore the maximum that can be locked simultaneously, is 64. This number is
> > > NT's own limit.
>
> AFAICT:
>
> spin_lock(&dev->wait_all_lock);
> list_for_each_entry(entry, &obj->all_waiters, node)
> for (i=0; i<count; i++)
> spin_lock_nest_lock(q->entries[i].obj->lock, &dev->wait_all_lock);
>
> Where @count <= NTSYNC_MAX_WAIT_COUNT.
>
> So while this nests at most 65 spinlocks, there is no actual bound on
> the amount of nested lock sections in total. That is, all_waiters list
> can be grown without limits.
>
> Can we pretty please make wait_all_lock a mutex ?

Hurmph, it's worse, you do that list walk while holding some obj->lock
spinlokc too. Still need to figure out how all that works....

2024-04-16 16:06:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 00/30] NT synchronization primitive driver

On Mon, Apr 15, 2024 at 08:08:10PM -0500, Elizabeth Figura wrote:

> The intended semantics of the patches are broadly intended to match those of the
> corresponding Windows functions. For those not already familiar with the Windows
> functions (or their undocumented behaviour), patch 27/27 provides a detailed
> specification, and individual patches also include a brief description of the
> API they are implementing.

You happen to have a readable copy of patch 27 around? RST is utter
garbage to read :/

2024-04-16 17:26:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 00/30] NT synchronization primitive driver

On Tue, Apr 16, 2024 at 05:53:45PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 16, 2024 at 05:50:14PM +0200, Peter Zijlstra wrote:
> > On Tue, Apr 16, 2024 at 10:14:21AM +0200, Peter Zijlstra wrote:
> >
> > > > Some aspects of the implementation may deserve particular comment:
> > > >
> > > > * In the interest of performance, each object is governed only by a single
> > > > spinlock. However, NTSYNC_IOC_WAIT_ALL requires that the state of multiple
> > > > objects be changed as a single atomic operation. In order to achieve this, we
> > > > first take a device-wide lock ("wait_all_lock") any time we are going to lock
> > > > more than one object at a time.
> > > >
> > > > The maximum number of objects that can be used in a vectored wait, and
> > > > therefore the maximum that can be locked simultaneously, is 64. This number is
> > > > NT's own limit.
> >
> > AFAICT:
> >
> > spin_lock(&dev->wait_all_lock);
> > list_for_each_entry(entry, &obj->all_waiters, node)
> > for (i=0; i<count; i++)
> > spin_lock_nest_lock(q->entries[i].obj->lock, &dev->wait_all_lock);
> >
> > Where @count <= NTSYNC_MAX_WAIT_COUNT.
> >
> > So while this nests at most 65 spinlocks, there is no actual bound on
> > the amount of nested lock sections in total. That is, all_waiters list
> > can be grown without limits.
> >
> > Can we pretty please make wait_all_lock a mutex ?
>
> Hurmph, it's worse, you do that list walk while holding some obj->lock
> spinlokc too. Still need to figure out how all that works....

So the point of having that other lock around is so that things like:

try_wake_all_obj(dev, sem)
try_wake_any_sem(sem)

are done under the same lock?

Where I seem to note that both those functions do that same list
iteration.

Can't you write things like:

static void try_wake_all_obj(struct nysync_device *dev,
struct ntsync_obj *obj,
void (*wake_obj)(struct ntsync_obj *obj))
{
list_for_each_entry(entry, &obj->all_waiters, node) {
spin_lock(&obj->lock);
try_wake_all(dev, event->q, obj);
wake_obj(obj);
spin_unlock(&obj->lock);
}
}

And then instead of the above, write:

try_wake_all_obj(dev, sem, wake_sem);

[[ Also, should not something like try_wake_any_sem -- wake_sem in the
above -- have something like:

WARN_ON_ONCE(sem->type != NTSYNC_TYPE_SEM);
]]



2024-04-16 21:18:38

by Elizabeth Figura

[permalink] [raw]
Subject: Re: [PATCH v4 00/30] NT synchronization primitive driver

On Tuesday, 16 April 2024 11:19:17 CDT Peter Zijlstra wrote:
> On Tue, Apr 16, 2024 at 05:53:45PM +0200, Peter Zijlstra wrote:
> > On Tue, Apr 16, 2024 at 05:50:14PM +0200, Peter Zijlstra wrote:
> > > On Tue, Apr 16, 2024 at 10:14:21AM +0200, Peter Zijlstra wrote:
> > > > > Some aspects of the implementation may deserve particular comment:
> > > > >
> > > > > * In the interest of performance, each object is governed only by a
> > > > > single
> > > > >
> > > > > spinlock. However, NTSYNC_IOC_WAIT_ALL requires that the state of
> > > > > multiple
> > > > > objects be changed as a single atomic operation. In order to
> > > > > achieve this, we first take a device-wide lock ("wait_all_lock")
> > > > > any time we are going to lock more than one object at a time.
> > > > >
> > > > > The maximum number of objects that can be used in a vectored wait,
> > > > > and
> > > > > therefore the maximum that can be locked simultaneously, is 64.
> > > > > This number is NT's own limit.
> > >
> > > AFAICT:
> > > spin_lock(&dev->wait_all_lock);
> > >
> > > list_for_each_entry(entry, &obj->all_waiters, node)
> > >
> > > for (i=0; i<count; i++)
> > >
> > > spin_lock_nest_lock(q->entries[i].obj->lock,
> > > &dev->wait_all_lock);
> > >
> > > Where @count <= NTSYNC_MAX_WAIT_COUNT.
> > >
> > > So while this nests at most 65 spinlocks, there is no actual bound on
> > > the amount of nested lock sections in total. That is, all_waiters list
> > > can be grown without limits.
> > >
> > > Can we pretty please make wait_all_lock a mutex ?

That should be fine, at least.

> > Hurmph, it's worse, you do that list walk while holding some obj->lock
> > spinlokc too. Still need to figure out how all that works....
>
> So the point of having that other lock around is so that things like:
>
> try_wake_all_obj(dev, sem)
> try_wake_any_sem(sem)
>
> are done under the same lock?

The point of having the other lock around is that try_wake_all() needs to lock
multiple objects at the same time. It's a way of avoiding lock inversion.

Consider task A does a wait-for-all on objects X, Y, Z. Then task B signals Y,
so we do try_wake_all_obj() on Y, which does try_wake_all() on A's queue
entry; that needs to check X and Z and consume the state of all three objects
atomically. Another task could be trying to signal Z at the same time and
could hit a task waiting on Z, Y, X, and that causes inversion.

The simple and easy way to implement everything is just to have a global lock
on the whole device, but this is kind of known to be a performance bottleneck
(this was NT's BKL, and they ditched it starting with Vista or 7 or
something).

Instead we use a lock per object, and normally in the wait-for-any case we
only ever need to grab one lock at a time, but when we need to do a wait-for-
all we need to lock multiple objects at once, and we grab the outer lock to
avoid potential lock inversion.

> Where I seem to note that both those functions do that same list
> iteration.

Over different lists. I don't know if there's a better way to name things to
make that clearer.

There's the "any" wait queue, which tasks which do a wait-for-any add
themselves to, and the "all" wait queue, which tasks that do a wait-for-all
add themselves to. Signaling an object could potentially wake up either one,
but checking whether a task is eligible is a different process.



2024-04-16 21:18:48

by Elizabeth Figura

[permalink] [raw]
Subject: Re: [PATCH v4 00/30] NT synchronization primitive driver

On Tuesday, 16 April 2024 11:05:53 CDT Peter Zijlstra wrote:
> On Mon, Apr 15, 2024 at 08:08:10PM -0500, Elizabeth Figura wrote:
> > The intended semantics of the patches are broadly intended to match those
> > of the corresponding Windows functions. For those not already familiar
> > with the Windows functions (or their undocumented behaviour), patch 27/27
> > provides a detailed specification, and individual patches also include a
> > brief description of the API they are implementing.
>
> You happen to have a readable copy of patch 27 around? RST is utter
> garbage to read :/

An HTML copy should be available here now (thanks Arek):

https://ivyl.gg/ntsync/ntsync.html

Let me know if that's good enough or if I should try to render it into plain
text somehow.



2024-04-16 21:18:57

by Elizabeth Figura

[permalink] [raw]
Subject: Re: [PATCH v4 00/30] NT synchronization primitive driver

On Tuesday, 16 April 2024 03:14:21 CDT Peter Zijlstra wrote:
> On Mon, Apr 15, 2024 at 08:08:10PM -0500, Elizabeth Figura wrote:
> > This patch series implements a new char misc driver, /dev/ntsync, which is
> > used to implement Windows NT synchronization primitives.
>
> This patch series does not apply to anything I have at hand. Nor does it
> state anything explicit to put it on top of.

It was written to apply against the 'char-misc-next' branch of gregkh/char-
misc.git. I'll make a note of that next time, sorry for the inconvenience.

> > Hence I would like to request review from someone familiar with locking to
> > make sure that the usage of low-level kernel primitives is correct and
> > that the wait queues work as intended, and to that end I've CC'd the
> > locking maintainers.
> I am sadly very limited atm, but I'll try and read through it. If only I
> could apply...
>
> > == Patches ==
> >
> > The intended semantics of the patches are broadly intended to match those
> > of the corresponding Windows functions. For those not already familiar
> > with the Windows functions (or their undocumented behaviour), patch 27/27
> > provides a detailed specification, and individual patches also include a
> > brief description of the API they are implementing.
> >
> > The patches making use of this driver in Wine can be retrieved or browsed
here:
> > https://repo.or.cz/wine/zf.git/shortlog/refs/heads/ntsync5
>
> I don't support GE has it in his builds? Last time I tried, building
> Wine was a bit of a pain.

It doesn't seem so. I tried to build a GE-compatible ntsync build, uploaded
here (thanks Arek for hosting):

https://f002.backblazeb2.com/file/wine-ntsync/ntsync-wine.tar.xz

> > Some aspects of the implementation may deserve particular comment:
> >
> > * In the interest of performance, each object is governed only by a single
> >
> > spinlock. However, NTSYNC_IOC_WAIT_ALL requires that the state of
> > multiple
> > objects be changed as a single atomic operation. In order to achieve
> > this, we first take a device-wide lock ("wait_all_lock") any time we
> > are going to lock more than one object at a time.
> >
> > The maximum number of objects that can be used in a vectored wait, and
> > therefore the maximum that can be locked simultaneously, is 64. This
> > number is NT's own limit.
> >
> > The acquisition of multiple spinlocks will degrade performance. This is
> > a
> > conscious choice, however. Wait-for-all is known to be a very rare
> > operation in practice, especially with counts that approach the
> > maximum, and it is the intent of the ntsync driver to optimize
> > wait-for-any at the expense of wait-for-all as much as possible.
>
> Per the example of percpu-rwsem, it would be possible to create a
> mutex-spinlock hybrid scheme, where single locks are spinlocks while
> held, but can block when the global thing is pending. And the global
> lock is always mutex like.
>
> If all that is worth it, I don't know. Nesting 64 spinlocks doesn't give
> me warm and fuzzy feelings though.

Is the concern about poor performance when ntsync is in use, or is nesting a
lot of spinlocks like that something that could cause problems for unrelated
tasks? I'm not familiar enough with the scheduler to know if this can be
abused.

I think we don't care about performance problems within Wine, at least. FWIW,
the scheme here is actually similar to what Windows does (as described by one
of their kernel engineers), although slightly different. NT nests spinlocks as
well, but instead of using the outer lock like our "wait_all_lock" to prevent
lock inversion, they instead sort the inner locks (by address, I assume).

If there's deeper problems... I can look into (ab)using a rwlock for this
purpose, at least for now.

In any case making wait_all_lock into a sleeping mutex instead of a spinlock
should be fine. I'll rerun performance tests but I don't expect it to cause
any problems.



2024-04-16 22:20:43

by Elizabeth Figura

[permalink] [raw]
Subject: Re: [PATCH v4 00/30] NT synchronization primitive driver

On Tuesday, 16 April 2024 16:18:24 CDT Elizabeth Figura wrote:
> On Tuesday, 16 April 2024 03:14:21 CDT Peter Zijlstra wrote:
> > I don't support GE has it in his builds? Last time I tried, building
> > Wine was a bit of a pain.
>
> It doesn't seem so. I tried to build a GE-compatible ntsync build, uploaded
> here (thanks Arek for hosting):
>
> https://f002.backblazeb2.com/file/wine-ntsync/ntsync-wine.tar.xz

Oops, the initial version I uploaded had broken paths. Should be fixed now.

(It's also broken on an unpatched kernel unless explicitly disabled with
WINE_DISABLE_FAST_SYNC=1. Not sure what I messed up there—it should fall back
cleanly—but hopefully shouldn't be too important for testing.)



2024-04-17 05:25:06

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 00/30] NT synchronization primitive driver

On Tue, Apr 16, 2024 at 04:18:17PM -0500, Elizabeth Figura wrote:

> Over different lists. I don't know if there's a better way to name things to
> make that clearer.

D'oh, reading hard. I'll stare more.

2024-04-17 05:25:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 00/30] NT synchronization primitive driver

On Tue, Apr 16, 2024 at 04:18:19PM -0500, Elizabeth Figura wrote:

> Let me know if that's good enough or if I should try to render it into plain
> text somehow.

Plain text is much preferred. I'm more of a text editor kinda guy --
being a programmer and all that.
>
>

2024-04-17 05:30:04

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 00/30] NT synchronization primitive driver

On Tue, Apr 16, 2024 at 04:18:24PM -0500, Elizabeth Figura wrote:

> Is the concern about poor performance when ntsync is in use, or is nesting a
> lot of spinlocks like that something that could cause problems for unrelated
> tasks? I'm not familiar enough with the scheduler to know if this can be
> abused.

The problem is keeping preemption disabled for potentially a fairly long
time. By doing that you potentially affect the performance of other tasks.

2024-04-17 06:06:09

by Elizabeth Figura

[permalink] [raw]
Subject: Re: [PATCH v4 00/30] NT synchronization primitive driver

On Wednesday, 17 April 2024 00:22:18 CDT Peter Zijlstra wrote:
> On Tue, Apr 16, 2024 at 04:18:19PM -0500, Elizabeth Figura wrote:
> > Let me know if that's good enough or if I should try to render it into
> > plain text somehow.
>
> Plain text is much preferred. I'm more of a text editor kinda guy --
> being a programmer and all that.

I can certainly sympathize with that ;-)

Here's a (slightly ad-hoc) simplification of the patch into text form inlined
into this message; hopefully it's readable enough.


===================================
NT synchronization primitive driver
===================================

This page documents the user-space API for the ntsync driver.

ntsync is a support driver for emulation of NT synchronization
primitives by user-space NT emulators. It exists because implementation
in user-space, using existing tools, cannot match Windows performance
while offering accurate semantics. It is implemented entirely in
software, and does not drive any hardware device.

This interface is meant as a compatibility tool only, and should not
be used for general synchronization. Instead use generic, versatile
interfaces such as futex(2) and poll(2).

Synchronization primitives
==========================

The ntsync driver exposes three types of synchronization primitives:
semaphores, mutexes, and events.

A semaphore holds a single volatile 32-bit counter, and a static 32-bit
integer denoting the maximum value. It is considered signaled when the
counter is nonzero. The counter is decremented by one when a wait is
satisfied. Both the initial and maximum count are established when the
semaphore is created.

A mutex holds a volatile 32-bit recursion count, and a volatile 32-bit
identifier denoting its owner. A mutex is considered signaled when its
owner is zero (indicating that it is not owned). The recursion count is
incremented when a wait is satisfied, and ownership is set to the given
identifier.

A mutex also holds an internal flag denoting whether its previous owner
has died; such a mutex is said to be abandoned. Owner death is not
tracked automatically based on thread death, but rather must be
communicated using NTSYNC_IOC_MUTEX_KILL. An abandoned mutex is
inherently considered unowned.

Except for the "unowned" semantics of zero, the actual value of the
owner identifier is not interpreted by the ntsync driver at all. The
intended use is to store a thread identifier; however, the ntsync
driver does not actually validate that a calling thread provides
consistent or unique identifiers.

An event holds a volatile boolean state denoting whether it is signaled
or not. There are two types of events, auto-reset and manual-reset. An
auto-reset event is designaled when a wait is satisfied; a manual-reset
event is not. The event type is specified when the event is created.

Unless specified otherwise, all operations on an object are atomic and
totally ordered with respect to other operations on the same object.

Objects are represented by files. When all file descriptors to an
object are closed, that object is deleted.

Char device
===========

The ntsync driver creates a single char device /dev/ntsync. Each file
description opened on the device represents a unique instance intended
to back an individual NT virtual machine. Objects created by one ntsync
instance may only be used with other objects created by the same
instance.

ioctl reference
===============

All operations on the device are done through ioctls. There are four
structures used in ioctl calls::

struct ntsync_sem_args {
__u32 sem;
__u32 count;
__u32 max;
};

struct ntsync_mutex_args {
__u32 mutex;
__u32 owner;
__u32 count;
};

struct ntsync_event_args {
__u32 event;
__u32 signaled;
__u32 manual;
};

struct ntsync_wait_args {
__u64 timeout;
__u64 objs;
__u32 count;
__u32 owner;
__u32 index;
__u32 alert;
__u32 flags;
__u32 pad;
};

Depending on the ioctl, members of the structure may be used as input,
output, or not at all. All ioctls return 0 on success.

The ioctls on the device file are as follows:

. NTSYNC_IOC_CREATE_SEM

Create a semaphore object. Takes a pointer to struct ntsync_sem_args,
which is used as follows:

* sem: On output, contains a file descriptor to the created semaphore.
* count: Initial count of the semaphore.
* max: Maximum count of the semaphore.

Fails with EINVAL if `count` is greater than `max`.

. NTSYNC_IOC_CREATE_MUTEX

Create a mutex object. Takes a pointer to struct ntsync_mutex_args,
which is used as follows:

* mutex: On output, contains a file descriptor to the created mutex.
* count: Initial recursion count of the mutex.
* owner: Initial owner of the mutex.

If ``owner`` is nonzero and ``count`` is zero, or if ``owner`` is zero
and ``count`` is nonzero, the function fails with EINVAL.

. NTSYNC_IOC_CREATE_EVENT

Create an event object. Takes a pointer to struct ntsync_event_args,
which is used as follows:

* event: On output, contains a file descriptor to the created event.
* signaled: If nonzero, the event is initially signaled, otherwise
nonsignaled.
* manual: If nonzero, the event is a manual-reset event, otherwise
auto-reset.

The ioctls on the individual objects are as follows:

. NTSYNC_IOC_SEM_POST

Post to a semaphore object. Takes a pointer to a 32-bit integer,
which on input holds the count to be added to the semaphore, and on
output contains its previous count.

If adding to the semaphore's current count would raise the latter
past the semaphore's maximum count, the ioctl fails with
EOVERFLOW and the semaphore is not affected. If raising the
semaphore's count causes it to become signaled, eligible threads
waiting on this semaphore will be woken and the semaphore's count
decremented appropriately.

. NTSYNC_IOC_MUTEX_UNLOCK

Release a mutex object. Takes a pointer to struct ntsync_mutex_args,
which is used as follows:

* mutex: Ignored.
* owner: Specifies the owner trying to release this mutex.
* count: On output, contains the previous recursion count.

If ``owner`` is zero, the ioctl fails with EINVAL. If ``owner``
is not the current owner of the mutex, the ioctl fails with
EPERM.

The mutex's count will be decremented by one. If decrementing the
mutex's count causes it to become zero, the mutex is marked as
unowned and signaled, and eligible threads waiting on it will be
woken as appropriate.

. NTSYNC_IOC_SET_EVENT

Signal an event object. Takes a pointer to a 32-bit integer, which on
output contains the previous state of the event.

Eligible threads will be woken, and auto-reset events will be
designaled appropriately.

. NTSYNC_IOC_RESET_EVENT

Designal an event object. Takes a pointer to a 32-bit integer, which
on output contains the previous state of the event.

. NTSYNC_IOC_PULSE_EVENT

Wake threads waiting on an event object while leaving it in an
unsignaled state. Takes a pointer to a 32-bit integer, which on
output contains the previous state of the event.

A pulse operation can be thought of as a set followed by a reset,
performed as a single atomic operation. If two threads are waiting on
an auto-reset event which is pulsed, only one will be woken. If two
threads are waiting a manual-reset event which is pulsed, both will
be woken. However, in both cases, the event will be unsignaled
afterwards, and a simultaneous read operation will always report the
event as unsignaled.

. NTSYNC_IOC_READ_SEM

Read the current state of a semaphore object. Takes a pointer to
struct ntsync_sem_args, which is used as follows:

* sem: Ignored.
* count: On output, contains the current count of the semaphore.
* max: On output, contains the maximum count of the semaphore.

. NTSYNC_IOC_READ_MUTEX

Read the current state of a mutex object. Takes a pointer to struct
ntsync_mutex_args, which is used as follows:

* mutex: Ignored.
* owner: On output, contains the current owner of the mutex, or zero
if the mutex is not currently owned.
* count: On output, contains the current recursion count of the mutex.

If the mutex is marked as abandoned, the function fails with
EOWNERDEAD. In this case, ``count`` and ``owner`` are set to zero.

. NTSYNC_IOC_READ_EVENT

Read the current state of an event object. Takes a pointer to struct
ntsync_event_args, which is used as follows:

* event: Ignored.
* signaled: On output, contains the current state of the event.
* manual: On output, contains 1 if the event is a manual-reset event,
and 0 otherwise.

. NTSYNC_IOC_KILL_OWNER

Mark a mutex as unowned and abandoned if it is owned by the given
owner. Takes an input-only pointer to a 32-bit integer denoting the
owner. If the owner is zero, the ioctl fails with EINVAL. If the
owner does not own the mutex, the function fails with EPERM.

Eligible threads waiting on the mutex will be woken as appropriate
(and such waits will fail with EOWNERDEAD, as described below).

. NTSYNC_IOC_WAIT_ANY

Poll on any of a list of objects, atomically acquiring at most one.
Takes a pointer to struct ntsync_wait_args, which is used as follows:

* timeout: Absolute timeout in nanoseconds. If NTSYNC_WAIT_REALTIME
is set, the timeout is measured against the REALTIME
clock; otherwise it is measured against the MONOTONIC
clock. If the timeout is equal to or earlier than the
current time, the function returns immediately without
sleeping. If ``timeout`` is U64_MAX, the function will
sleep until an object is signaled, and will not fail
with ETIMEDOUT.

* objs: Pointer to an array of ``count`` file descriptors
(specified as an integer so that the structure has the
same size regardless of architecture). If any object is
invalid, the function fails with EINVAL.

* count: Number of objects specified in the ``objs`` array. If
greater than NTSYNC_MAX_WAIT_COUNT, the function fails
with EINVAL.

* owner: Mutex owner identifier. If any object in ``objs`` is a
mutex, the ioctl will attempt to acquire that mutex on
behalf of ``owner``. If ``owner`` is zero, the ioctl
fails with EINVAL.

* index: On success, contains the index (into ``objs``) of the
object which was signaled. If ``alert`` was signaled
instead, this contains ``count``.

* alert: Optional event object file descriptor. If nonzero, this
specifies an "alert" event object which, if signaled,
will terminate the wait. If nonzero, the identifier must
point to a valid event.

* flags: Zero or more flags. Currently the only flag is
NTSYNC_WAIT_REALTIME, which causes the timeout to be
measured against the REALTIME clock instead of
MONOTONIC.

* pad: Unused, must be set to zero.

This function attempts to acquire one of the given objects. If unable
to do so, it sleeps until an object becomes signaled, subsequently
acquiring it, or the timeout expires. In the latter case the ioctl
fails with ETIMEDOUT. The function only acquires one object, even if
multiple objects are signaled.

A semaphore is considered to be signaled if its count is nonzero, and
is acquired by decrementing its count by one. A mutex is considered
to be signaled if it is unowned or if its owner matches the ``owner``
argument, and is acquired by incrementing its recursion count by one
and setting its owner to the ``owner`` argument. An auto-reset event
is acquired by designaling it; a manual-reset event is not affected
by acquisition.

Acquisition is atomic and totally ordered with respect to other
operations on the same object. If two wait operations (with different
``owner`` identifiers) are queued on the same mutex, only one is
signaled. If two wait operations are queued on the same semaphore,
and a value of one is posted to it, only one is signaled. The order
in which threads are signaled is not specified.

If an abandoned mutex is acquired, the ioctl fails with
EOWNERDEAD. Although this is a failure return, the function may
otherwise be considered successful. The mutex is marked as owned by
the given owner (with a recursion count of 1) and as no longer
abandoned, and ``index`` is still set to the index of the mutex.

The ``alert`` argument is an "extra" event which can terminate the
wait, independently of all other objects. If members of ``objs`` and
``alert`` are both simultaneously signaled, a member of ``objs`` will
always be given priority and acquired first.

It is valid to pass the same object more than once, including by
passing the same event in the ``objs`` array and in ``alert``. If a
wakeup occurs due to that object being signaled, ``index`` is set to
the lowest index corresponding to that object.

The function may fail with EINTR if a signal is received.

. NTSYNC_IOC_WAIT_ALL

Poll on a list of objects, atomically acquiring all of them. Takes a
pointer to struct ntsync_wait_args, which is used identically to
NTSYNC_IOC_WAIT_ANY, except that ``index`` is always filled with zero
on success if not woken via alert.

This function attempts to simultaneously acquire all of the given
objects. If unable to do so, it sleeps until all objects become
simultaneously signaled, subsequently acquiring them, or the timeout
expires. In the latter case the ioctl fails with ETIMEDOUT and no
objects are modified.

Objects may become signaled and subsequently designaled (through
acquisition by other threads) while this thread is sleeping. Only
once all objects are simultaneously signaled does the ioctl acquire
them and return. The entire acquisition is atomic and totally ordered
with respect to other operations on any of the given objects.

If an abandoned mutex is acquired, the ioctl fails with
EOWNERDEAD. Similarly to NTSYNC_IOC_WAIT_ANY, all objects are
nevertheless marked as acquired. Note that if multiple mutex objects
are specified, there is no way to know which were marked as
abandoned.

As with "any" waits, the ``alert`` argument is an "extra" event which
can terminate the wait. Critically, however, an "all" wait will
succeed if all members in ``objs`` are signaled, *or* if ``alert`` is
signaled. In the latter case ``index`` will be set to ``count``. As
with "any" waits, if both conditions are filled, the former takes
priority, and objects in ``objs`` will be acquired.

Unlike NTSYNC_IOC_WAIT_ANY, it is not valid to pass the same
object more than once, nor is it valid to pass the same object in
``objs`` and in ``alert``. If this is attempted, the function fails
with EINVAL.



2024-04-17 10:02:22

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 00/30] NT synchronization primitive driver

On Wed, Apr 17, 2024 at 01:05:47AM -0500, Elizabeth Figura wrote:

> Here's a (slightly ad-hoc) simplification of the patch into text form inlined
> into this message; hopefully it's readable enough.

Thanks!

Still needed:

s/\`\`/"/g
s/\.\.\ //g

But then it's readable

>
> ===================================
> NT synchronization primitive driver
> ===================================
>
> This page documents the user-space API for the ntsync driver.
>
> ntsync is a support driver for emulation of NT synchronization
> primitives by user-space NT emulators. It exists because implementation
> in user-space, using existing tools, cannot match Windows performance
> while offering accurate semantics. It is implemented entirely in
> software, and does not drive any hardware device.
>
> This interface is meant as a compatibility tool only, and should not
> be used for general synchronization. Instead use generic, versatile
> interfaces such as futex(2) and poll(2).
>
> Synchronization primitives
> ==========================
>
> The ntsync driver exposes three types of synchronization primitives:
> semaphores, mutexes, and events.
>
> A semaphore holds a single volatile 32-bit counter, and a static 32-bit
> integer denoting the maximum value. It is considered signaled when the
> counter is nonzero. The counter is decremented by one when a wait is
> satisfied. Both the initial and maximum count are established when the
> semaphore is created.
>
> A mutex holds a volatile 32-bit recursion count, and a volatile 32-bit
> identifier denoting its owner. A mutex is considered signaled when its
> owner is zero (indicating that it is not owned). The recursion count is
> incremented when a wait is satisfied, and ownership is set to the given
> identifier.

'signaled' is used twice now but not defined. For both Semaphore and
Mutex this seems to indicate uncontended? Edit: seems to be needs-wakeup
more than uncontended.

> A mutex also holds an internal flag denoting whether its previous owner
> has died; such a mutex is said to be abandoned. Owner death is not
> tracked automatically based on thread death, but rather must be
> communicated using NTSYNC_IOC_MUTEX_KILL. An abandoned mutex is
> inherently considered unowned.
>
> Except for the "unowned" semantics of zero, the actual value of the
> owner identifier is not interpreted by the ntsync driver at all. The
> intended use is to store a thread identifier; however, the ntsync
> driver does not actually validate that a calling thread provides
> consistent or unique identifiers.

Why not verify it? Seems simple enough to put in a TID check, esp. if NT
mandates the same.

> An event holds a volatile boolean state denoting whether it is signaled
> or not. There are two types of events, auto-reset and manual-reset. An
> auto-reset event is designaled when a wait is satisfied; a manual-reset
> event is not. The event type is specified when the event is created.

But what is an event? I'm familiar with semaphores and mutexes, but less
so with events.

> Unless specified otherwise, all operations on an object are atomic and
> totally ordered with respect to other operations on the same object.
>
> Objects are represented by files. When all file descriptors to an
> object are closed, that object is deleted.
>
> Char device
> ===========
>
> The ntsync driver creates a single char device /dev/ntsync. Each file
> description opened on the device represents a unique instance intended
> to back an individual NT virtual machine. Objects created by one ntsync
> instance may only be used with other objects created by the same
> instance.
>
> ioctl reference
> ===============
>
> All operations on the device are done through ioctls. There are four
> structures used in ioctl calls::
>
> struct ntsync_sem_args {
> __u32 sem;
> __u32 count;
> __u32 max;
> };
>
> struct ntsync_mutex_args {
> __u32 mutex;
> __u32 owner;
> __u32 count;
> };
>
> struct ntsync_event_args {
> __u32 event;
> __u32 signaled;
> __u32 manual;
> };
>
> struct ntsync_wait_args {
> __u64 timeout;
> __u64 objs;
> __u32 count;
> __u32 owner;
> __u32 index;
> __u32 alert;
> __u32 flags;
> __u32 pad;
> };
>
> Depending on the ioctl, members of the structure may be used as input,
> output, or not at all. All ioctls return 0 on success.
>
> The ioctls on the device file are as follows:
>
> NTSYNC_IOC_CREATE_SEM
>
> Create a semaphore object. Takes a pointer to struct ntsync_sem_args,
> which is used as follows:
>
> * sem: On output, contains a file descriptor to the created semaphore.
> * count: Initial count of the semaphore.
> * max: Maximum count of the semaphore.
>
> Fails with EINVAL if `count` is greater than `max`.

So the implication is that @count and @max are input argument and as
such should be set before calling the ioctl()?

It would not have been weird to have the ioctl() return the fd on
success I suppose, instead of mixing input and output arguments like
this, but whatever, this works.

> NTSYNC_IOC_CREATE_MUTEX
>
> Create a mutex object. Takes a pointer to struct ntsync_mutex_args,
> which is used as follows:
>
> * mutex: On output, contains a file descriptor to the created mutex.
> * count: Initial recursion count of the mutex.
> * owner: Initial owner of the mutex.
>
> If "owner" is nonzero and "count" is zero, or if "owner" is zero
> and "count" is nonzero, the function fails with EINVAL.
>
> NTSYNC_IOC_CREATE_EVENT
>
> Create an event object. Takes a pointer to struct ntsync_event_args,
> which is used as follows:
>
> * event: On output, contains a file descriptor to the created event.
> * signaled: If nonzero, the event is initially signaled, otherwise
> nonsignaled.
> * manual: If nonzero, the event is a manual-reset event, otherwise
> auto-reset.
>

Still mystified as to what event actually is, perhaps more clues
below...

> The ioctls on the individual objects are as follows:
>
> NTSYNC_IOC_SEM_POST
>
> Post to a semaphore object. Takes a pointer to a 32-bit integer,
> which on input holds the count to be added to the semaphore, and on
> output contains its previous count.
>
> If adding to the semaphore's current count would raise the latter
> past the semaphore's maximum count, the ioctl fails with
> EOVERFLOW and the semaphore is not affected. If raising the
> semaphore's count causes it to become signaled, eligible threads
> waiting on this semaphore will be woken and the semaphore's count
> decremented appropriately.

Urg, so this is the traditional V (vrijgeven per Dijkstra, release in
English), but now 'conveniently' called POST, such that it can be
readily confused with the P operation (passering, or passing) which it
is not.

Glorious :-/

You're of course going to tell me NT did this and you can't help this
naming foible.

> NTSYNC_IOC_MUTEX_UNLOCK
>
> Release a mutex object. Takes a pointer to struct ntsync_mutex_args,
> which is used as follows:
>
> * mutex: Ignored.
> * owner: Specifies the owner trying to release this mutex.
> * count: On output, contains the previous recursion count.
>
> If "owner" is zero, the ioctl fails with EINVAL. If "owner"
> is not the current owner of the mutex, the ioctl fails with
> EPERM.

ISTR you having written elsewhere that NT actually demands mutexes to be
strictly per thread, which for the above would mandate @owner to be
current, no?

> The mutex's count will be decremented by one. If decrementing the
> mutex's count causes it to become zero, the mutex is marked as
> unowned and signaled, and eligible threads waiting on it will be
> woken as appropriate.
>
> NTSYNC_IOC_SET_EVENT
>
> Signal an event object. Takes a pointer to a 32-bit integer, which on
> output contains the previous state of the event.
>
> Eligible threads will be woken, and auto-reset events will be
> designaled appropriately.

Hmm, so the event thing is like a simple wait-wake scheme? Where the
'signaled' bit is used as the wakeup state?

> NTSYNC_IOC_RESET_EVENT
>
> Designal an event object. Takes a pointer to a 32-bit integer, which
> on output contains the previous state of the event.
>
> NTSYNC_IOC_PULSE_EVENT
>
> Wake threads waiting on an event object while leaving it in an
> unsignaled state. Takes a pointer to a 32-bit integer, which on
> output contains the previous state of the event.
>
> A pulse operation can be thought of as a set followed by a reset,
> performed as a single atomic operation. If two threads are waiting on
> an auto-reset event which is pulsed, only one will be woken. If two
> threads are waiting a manual-reset event which is pulsed, both will
> be woken. However, in both cases, the event will be unsignaled
> afterwards, and a simultaneous read operation will always report the
> event as unsignaled.

*groan*

> NTSYNC_IOC_READ_SEM
>
> Read the current state of a semaphore object. Takes a pointer to
> struct ntsync_sem_args, which is used as follows:
>
> * sem: Ignored.
> * count: On output, contains the current count of the semaphore.
> * max: On output, contains the maximum count of the semaphore.

This seems inherently racy -- what is the intended purpose of this
interface?

Specifically the moment a value is returned, either P or V operations
can change it, rendering the (as yet unused) return value incorrect.

> NTSYNC_IOC_READ_MUTEX
>
> Read the current state of a mutex object. Takes a pointer to struct
> ntsync_mutex_args, which is used as follows:
>
> * mutex: Ignored.
> * owner: On output, contains the current owner of the mutex, or zero
> if the mutex is not currently owned.
> * count: On output, contains the current recursion count of the mutex.
>
> If the mutex is marked as abandoned, the function fails with
> EOWNERDEAD. In this case, "count" and "owner" are set to zero.

Another questionable interface. I suspect you're going to be telling me
NT has them so you have to have them, but urgh.

> NTSYNC_IOC_READ_EVENT
>
> Read the current state of an event object. Takes a pointer to struct
> ntsync_event_args, which is used as follows:
>
> * event: Ignored.
> * signaled: On output, contains the current state of the event.
> * manual: On output, contains 1 if the event is a manual-reset event,
> and 0 otherwise.

I can't help but notice all those @sem, @mutex, @event 'output' members
being unused except for create. Seems like a waste to have them.

> NTSYNC_IOC_KILL_OWNER
>
> Mark a mutex as unowned and abandoned if it is owned by the given
> owner. Takes an input-only pointer to a 32-bit integer denoting the
> owner. If the owner is zero, the ioctl fails with EINVAL. If the
> owner does not own the mutex, the function fails with EPERM.
>
> Eligible threads waiting on the mutex will be woken as appropriate
> (and such waits will fail with EOWNERDEAD, as described below).

Wine will use this when it detects a thread exit I suppose.

> NTSYNC_IOC_WAIT_ANY
>
> Poll on any of a list of objects, atomically acquiring at most one.
> Takes a pointer to struct ntsync_wait_args, which is used as follows:
>
> * timeout: Absolute timeout in nanoseconds. If NTSYNC_WAIT_REALTIME
> is set, the timeout is measured against the REALTIME
> clock; otherwise it is measured against the MONOTONIC
> clock. If the timeout is equal to or earlier than the
> current time, the function returns immediately without
> sleeping. If "timeout" is U64_MAX, the function will
> sleep until an object is signaled, and will not fail
> with ETIMEDOUT.
>
> * objs: Pointer to an array of "count" file descriptors
> (specified as an integer so that the structure has the
> same size regardless of architecture). If any object is
> invalid, the function fails with EINVAL.
>
> * count: Number of objects specified in the "objs" array. If
> greater than NTSYNC_MAX_WAIT_COUNT, the function fails
> with EINVAL.
>
> * owner: Mutex owner identifier. If any object in "objs" is a
> mutex, the ioctl will attempt to acquire that mutex on
> behalf of "owner". If "owner" is zero, the ioctl
> fails with EINVAL.

Again, should that not be current? That is, why not maintain the NT
invariant and mandates TIDs and avoid the arguments in both cases?

> * index: On success, contains the index (into "objs") of the
> object which was signaled. If "alert" was signaled
> instead, this contains "count".

Could be the actual return value, no? Edit: no it cannot be because
-EOWNERDEAD case below.

>
> * alert: Optional event object file descriptor. If nonzero, this
> specifies an "alert" event object which, if signaled,
> will terminate the wait. If nonzero, the identifier must
> point to a valid event.
>
> * flags: Zero or more flags. Currently the only flag is
> NTSYNC_WAIT_REALTIME, which causes the timeout to be
> measured against the REALTIME clock instead of
> MONOTONIC.
>
> * pad: Unused, must be set to zero.
>
> This function attempts to acquire one of the given objects. If unable
> to do so, it sleeps until an object becomes signaled, subsequently
> acquiring it, or the timeout expires. In the latter case the ioctl
> fails with ETIMEDOUT. The function only acquires one object, even if
> multiple objects are signaled.

Any guarantee as to which will be acquired in case multiple are
available? [A]

> A semaphore is considered to be signaled if its count is nonzero, and
> is acquired by decrementing its count by one. A mutex is considered
> to be signaled if it is unowned or if its owner matches the "owner"
> argument, and is acquired by incrementing its recursion count by one
> and setting its owner to the "owner" argument. An auto-reset event
> is acquired by designaling it; a manual-reset event is not affected
> by acquisition.
>
> Acquisition is atomic and totally ordered with respect to other
> operations on the same object. If two wait operations (with different
> "owner" identifiers) are queued on the same mutex, only one is
> signaled. If two wait operations are queued on the same semaphore,
> and a value of one is posted to it, only one is signaled. The order
> in which threads are signaled is not specified.

Note that you do list the lack of guarantee here, but not above. I
suspect both cases are similar and guarantee nothing.

> If an abandoned mutex is acquired, the ioctl fails with
> EOWNERDEAD. Although this is a failure return, the function may
> otherwise be considered successful. The mutex is marked as owned by
> the given owner (with a recursion count of 1) and as no longer
> abandoned, and "index" is still set to the index of the mutex.

Aaah, I see, this does indeed preclude @index from being the return
value.

> The "alert" argument is an "extra" event which can terminate the
> wait, independently of all other objects. If members of "objs" and
> "alert" are both simultaneously signaled, a member of "objs" will
> always be given priority and acquired first.
>
> It is valid to pass the same object more than once, including by
> passing the same event in the "objs" array and in "alert". If a
> wakeup occurs due to that object being signaled, "index" is set to
> the lowest index corresponding to that object.

Urgh, is this an actual guarantee? This almost seems to imply that at
[A] above we can indeed guarantee the lowest indexed object is acquired
first.

> The function may fail with EINTR if a signal is received.

In which case @index must be disregarded since nothing will be acquired,
right?

So far nothing really weird, and I'm thinking futexes should be able to
do all this, no?

> NTSYNC_IOC_WAIT_ALL
>
> Poll on a list of objects, atomically acquiring all of them. Takes a
> pointer to struct ntsync_wait_args, which is used identically to
> NTSYNC_IOC_WAIT_ANY, except that "index" is always filled with zero
> on success if not woken via alert.

Whee, and this is the one weird operation that you're all struggling to
emulate, right? The atomic multi-acquire is 'hard' to do with futexes.

> This function attempts to simultaneously acquire all of the given
> objects. If unable to do so, it sleeps until all objects become
> simultaneously signaled, subsequently acquiring them, or the timeout
> expires. In the latter case the ioctl fails with ETIMEDOUT and no
> objects are modified.
>
> Objects may become signaled and subsequently designaled (through
> acquisition by other threads) while this thread is sleeping. Only
> once all objects are simultaneously signaled does the ioctl acquire
> them and return. The entire acquisition is atomic and totally ordered
> with respect to other operations on any of the given objects.
>
> If an abandoned mutex is acquired, the ioctl fails with
> EOWNERDEAD. Similarly to NTSYNC_IOC_WAIT_ANY, all objects are
> nevertheless marked as acquired. Note that if multiple mutex objects
> are specified, there is no way to know which were marked as
> abandoned.
>
> As with "any" waits, the "alert" argument is an "extra" event which
> can terminate the wait. Critically, however, an "all" wait will
> succeed if all members in "objs" are signaled, *or* if "alert" is
> signaled. In the latter case "index" will be set to "count". As
> with "any" waits, if both conditions are filled, the former takes
> priority, and objects in "objs" will be acquired.
>
> Unlike NTSYNC_IOC_WAIT_ANY, it is not valid to pass the same
> object more than once, nor is it valid to pass the same object in
> "objs" and in "alert". If this is attempted, the function fails
> with EINVAL.

OK, this all was helpful, I'll go stare at the code again.

Thanks!

2024-04-17 16:32:06

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 02/27] ntsync: Introduce NTSYNC_IOC_WAIT_ALL.

On Mon, Apr 15, 2024 at 08:08:12PM -0500, Elizabeth Figura wrote:
> + if (atomic_read(&sem->all_hint) > 0) {
> + spin_lock(&dev->wait_all_lock);
> + spin_lock_nest_lock(&sem->lock, &dev->wait_all_lock);
>
> + prev_count = sem->u.sem.count;
> + ret = post_sem_state(sem, args);
> + if (!ret) {
> + try_wake_all_obj(dev, sem);
> + try_wake_any_sem(sem);
> + }
>
> + spin_unlock(&sem->lock);
> + spin_unlock(&dev->wait_all_lock);
> + } else {
> + spin_lock(&sem->lock);
> +
> + prev_count = sem->u.sem.count;
> + ret = post_sem_state(sem, args);
> + if (!ret)
> + try_wake_any_sem(sem);
> +
> + spin_unlock(&sem->lock);
> + }
>
> if (!ret && put_user(prev_count, user_args))
> ret = -EFAULT;

vs.

> + /* queue ourselves */
> +
> + spin_lock(&dev->wait_all_lock);
> +
> + for (i = 0; i < args.count; i++) {
> + struct ntsync_q_entry *entry = &q->entries[i];
> + struct ntsync_obj *obj = entry->obj;
> +
> + atomic_inc(&obj->all_hint);
> +
> + /*
> + * obj->all_waiters is protected by dev->wait_all_lock rather
> + * than obj->lock, so there is no need to acquire obj->lock
> + * here.
> + */
> + list_add_tail(&entry->node, &obj->all_waiters);
> + }

This looks racy, consider:

atomic_read(all_hints) /* 0 */

spin_lock(wait_all_lock)
atomic_inc(all_hint) /* 1 */
list_add_tail()

spin_lock(sem->lock)
/* try_wake_all_obj() missing */




I've not yet thought about if this is harmful or not, but if not, it
definitely needs a comment.

Anyway, I need a break, maybe more this evening.



2024-04-17 20:03:49

by Elizabeth Figura

[permalink] [raw]
Subject: Re: [PATCH v4 00/30] NT synchronization primitive driver

On Wednesday, 17 April 2024 05:01:32 CDT Peter Zijlstra wrote:
> >
> > ===================================
> > NT synchronization primitive driver
> > ===================================
> >
> > This page documents the user-space API for the ntsync driver.
> >
> > ntsync is a support driver for emulation of NT synchronization
> > primitives by user-space NT emulators. It exists because implementation
> > in user-space, using existing tools, cannot match Windows performance
> > while offering accurate semantics. It is implemented entirely in
> > software, and does not drive any hardware device.
> >
> > This interface is meant as a compatibility tool only, and should not
> > be used for general synchronization. Instead use generic, versatile
> > interfaces such as futex(2) and poll(2).
> >
> > Synchronization primitives
> > ==========================
> >
> > The ntsync driver exposes three types of synchronization primitives:
> > semaphores, mutexes, and events.
> >
> > A semaphore holds a single volatile 32-bit counter, and a static 32-bit
> > integer denoting the maximum value. It is considered signaled when the
> > counter is nonzero. The counter is decremented by one when a wait is
> > satisfied. Both the initial and maximum count are established when the
> > semaphore is created.
> >
> > A mutex holds a volatile 32-bit recursion count, and a volatile 32-bit
> > identifier denoting its owner. A mutex is considered signaled when its
> > owner is zero (indicating that it is not owned). The recursion count is
> > incremented when a wait is satisfied, and ownership is set to the given
> > identifier.
>
> 'signaled' is used twice now but not defined. For both Semaphore and
> Mutex this seems to indicate uncontended? Edit: seems to be needs-wakeup
> more than uncontended.

Uncontended, yes, or needs-wakeup (I'm not sure what the difference
between the two is?)

> > A mutex also holds an internal flag denoting whether its previous owner
> > has died; such a mutex is said to be abandoned. Owner death is not
> > tracked automatically based on thread death, but rather must be
> > communicated using NTSYNC_IOC_MUTEX_KILL. An abandoned mutex is
> > inherently considered unowned.
> >
> > Except for the "unowned" semantics of zero, the actual value of the
> > owner identifier is not interpreted by the ntsync driver at all. The
> > intended use is to store a thread identifier; however, the ntsync
> > driver does not actually validate that a calling thread provides
> > consistent or unique identifiers.
>
> Why not verify it? Seems simple enough to put in a TID check, esp. if NT
> mandates the same.

I mostly figured it'd be simplest to leave the driver completely
agnostic, but I don't think there's any reason we can't use the real
TID for most calls.

> > An event holds a volatile boolean state denoting whether it is signaled
> > or not. There are two types of events, auto-reset and manual-reset. An
> > auto-reset event is designaled when a wait is satisfied; a manual-reset
> > event is not. The event type is specified when the event is created.
>
> But what is an event? I'm familiar with semaphores and mutexes, but less
> so with events.

It's what I'm trying to define there, a single bit of state that's
either contended or not. It acts broadly like an eventfd, in that you
can wake (write) or wait (read), but without the distinction of having
different nonzero values in the internal counter.

You could also think of it as a semaphore with a maximum count of one.
However, unlike a semaphore it also supports the "pulse" operation, and
you can also have "manual-reset" events that *don't* change state when
you wait on them (no equivalent for regular semaphores).

> > Unless specified otherwise, all operations on an object are atomic and
> > totally ordered with respect to other operations on the same object.
> >
> > Objects are represented by files. When all file descriptors to an
> > object are closed, that object is deleted.
> >
> > Char device
> > ===========
> >
> > The ntsync driver creates a single char device /dev/ntsync. Each file
> > description opened on the device represents a unique instance intended
> > to back an individual NT virtual machine. Objects created by one ntsync
> > instance may only be used with other objects created by the same
> > instance.
> >
> > ioctl reference
> > ===============
> >
> > All operations on the device are done through ioctls. There are four
> > structures used in ioctl calls::
> >
> > struct ntsync_sem_args {
> > __u32 sem;
> > __u32 count;
> > __u32 max;
> > };
> >
> > struct ntsync_mutex_args {
> > __u32 mutex;
> > __u32 owner;
> > __u32 count;
> > };
> >
> > struct ntsync_event_args {
> > __u32 event;
> > __u32 signaled;
> > __u32 manual;
> > };
> >
> > struct ntsync_wait_args {
> > __u64 timeout;
> > __u64 objs;
> > __u32 count;
> > __u32 owner;
> > __u32 index;
> > __u32 alert;
> > __u32 flags;
> > __u32 pad;
> > };
> >
> > Depending on the ioctl, members of the structure may be used as input,
> > output, or not at all. All ioctls return 0 on success.
> >
> > The ioctls on the device file are as follows:
> >
> > NTSYNC_IOC_CREATE_SEM
> >
> > Create a semaphore object. Takes a pointer to struct ntsync_sem_args,
> > which is used as follows:
> >
> > * sem: On output, contains a file descriptor to the created semaphore.
> > * count: Initial count of the semaphore.
> > * max: Maximum count of the semaphore.
> >
> > Fails with EINVAL if `count` is greater than `max`.
>
> So the implication is that @count and @max are input argument and as
> such should be set before calling the ioctl()?
>
> It would not have been weird to have the ioctl() return the fd on
> success I suppose, instead of mixing input and output arguments like
> this, but whatever, this works.

I think that would have been fine, and I could still change it
accordingly. The reason I didn't do that was that [1] advises against
it (although I don't know why).

[1] https://docs.kernel.org/driver-api/ioctl.html#return-code

> > The ioctls on the individual objects are as follows:
> >
> > NTSYNC_IOC_SEM_POST
> >
> > Post to a semaphore object. Takes a pointer to a 32-bit integer,
> > which on input holds the count to be added to the semaphore, and on
> > output contains its previous count.
> >
> > If adding to the semaphore's current count would raise the latter
> > past the semaphore's maximum count, the ioctl fails with
> > EOVERFLOW and the semaphore is not affected. If raising the
> > semaphore's count causes it to become signaled, eligible threads
> > waiting on this semaphore will be woken and the semaphore's count
> > decremented appropriately.
>
> Urg, so this is the traditional V (vrijgeven per Dijkstra, release in
> English), but now 'conveniently' called POST, such that it can be
> readily confused with the P operation (passering, or passing) which it
> is not.
>
> Glorious :-/
>
> You're of course going to tell me NT did this and you can't help this
> naming foible.

No, NT calls it "release" (and the operation on a mutex is also
"release" rather than "unlock".) I called it "post" after POSIX
semaphores, on the idea that it'd be more familiar to a Unix developer
(and shorter). I see I was wrong, so I'll rename it to "release".

> > NTSYNC_IOC_MUTEX_UNLOCK
> >
> > Release a mutex object. Takes a pointer to struct ntsync_mutex_args,
> > which is used as follows:
> >
> > * mutex: Ignored.
> > * owner: Specifies the owner trying to release this mutex.
> > * count: On output, contains the previous recursion count.
> >
> > If "owner" is zero, the ioctl fails with EINVAL. If "owner"
> > is not the current owner of the mutex, the ioctl fails with
> > EPERM.
>
> ISTR you having written elsewhere that NT actually demands mutexes to be
> strictly per thread, which for the above would mandate @owner to be
> current, no?

Right. We could replace owner with current everywhere except for
NTSYNC_IOC_KILL_OWNER.

> > The mutex's count will be decremented by one. If decrementing the
> > mutex's count causes it to become zero, the mutex is marked as
> > unowned and signaled, and eligible threads waiting on it will be
> > woken as appropriate.
> >
> > NTSYNC_IOC_SET_EVENT
> >
> > Signal an event object. Takes a pointer to a 32-bit integer, which on
> > output contains the previous state of the event.
> >
> > Eligible threads will be woken, and auto-reset events will be
> > designaled appropriately.
>
> Hmm, so the event thing is like a simple wait-wake scheme? Where the
> 'signaled' bit is used as the wakeup state?

Yes, exactly.

> > NTSYNC_IOC_RESET_EVENT
> >
> > Designal an event object. Takes a pointer to a 32-bit integer, which
> > on output contains the previous state of the event.
> >
> > NTSYNC_IOC_PULSE_EVENT
> >
> > Wake threads waiting on an event object while leaving it in an
> > unsignaled state. Takes a pointer to a 32-bit integer, which on
> > output contains the previous state of the event.
> >
> > A pulse operation can be thought of as a set followed by a reset,
> > performed as a single atomic operation. If two threads are waiting on
> > an auto-reset event which is pulsed, only one will be woken. If two
> > threads are waiting a manual-reset event which is pulsed, both will
> > be woken. However, in both cases, the event will be unsignaled
> > afterwards, and a simultaneous read operation will always report the
> > event as unsignaled.
>
> *groan*

Yep :D

This one is terrible, and it's the only one that Microsoft has come out
and explicitly said "don't use this". Supposedly their kernel is even
coded such that if a waiting thread gets hit by an interrupt that the
pulse will go unnoticed, although I've tried to reproduce this in
practice and been unsuccessful.

But of course it's terrible regardless, because you never know if your
thread is waiting or not. In practice it seems to usually be used on a
timer, though, so that part doesn't matter as much.

> > NTSYNC_IOC_READ_SEM
> >
> > Read the current state of a semaphore object. Takes a pointer to
> > struct ntsync_sem_args, which is used as follows:
> >
> > * sem: Ignored.
> > * count: On output, contains the current count of the semaphore.
> > * max: On output, contains the maximum count of the semaphore.
>
> This seems inherently racy -- what is the intended purpose of this
> interface?
>
> Specifically the moment a value is returned, either P or V operations
> can change it, rendering the (as yet unused) return value incorrect.

I have no idea what it's intended for. Actually it's not even exposed
as a documented API, only an undocumented one. But it does work, and
applications use it.

> > NTSYNC_IOC_READ_MUTEX
> >
> > Read the current state of a mutex object. Takes a pointer to struct
> > ntsync_mutex_args, which is used as follows:
> >
> > * mutex: Ignored.
> > * owner: On output, contains the current owner of the mutex, or zero
> > if the mutex is not currently owned.
> > * count: On output, contains the current recursion count of the mutex.
> >
> > If the mutex is marked as abandoned, the function fails with
> > EOWNERDEAD. In this case, "count" and "owner" are set to zero.
>
> Another questionable interface. I suspect you're going to be telling me
> NT has them so you have to have them, but urgh.

Unfortunately yes.

> > NTSYNC_IOC_READ_EVENT
> >
> > Read the current state of an event object. Takes a pointer to struct
> > ntsync_event_args, which is used as follows:
> >
> > * event: Ignored.
> > * signaled: On output, contains the current state of the event.
> > * manual: On output, contains 1 if the event is a manual-reset event,
> > and 0 otherwise.
>
> I can't help but notice all those @sem, @mutex, @event 'output' members
> being unused except for create. Seems like a waste to have them.

Yes, mostly so I could reuse the existing structures.

On the other hand if there's no reason not to return fds from the
create ioctls, then we could just remove those members.

> > NTSYNC_IOC_KILL_OWNER
> >
> > Mark a mutex as unowned and abandoned if it is owned by the given
> > owner. Takes an input-only pointer to a 32-bit integer denoting the
> > owner. If the owner is zero, the ioctl fails with EINVAL. If the
> > owner does not own the mutex, the function fails with EPERM.
> >
> > Eligible threads waiting on the mutex will be woken as appropriate
> > (and such waits will fail with EOWNERDEAD, as described below).
>
> Wine will use this when it detects a thread exit I suppose.

Exactly.

> > NTSYNC_IOC_WAIT_ANY
> >
> > Poll on any of a list of objects, atomically acquiring at most one.
> > Takes a pointer to struct ntsync_wait_args, which is used as follows:
> >
> > * timeout: Absolute timeout in nanoseconds. If NTSYNC_WAIT_REALTIME
> > is set, the timeout is measured against the REALTIME
> > clock; otherwise it is measured against the MONOTONIC
> > clock. If the timeout is equal to or earlier than the
> > current time, the function returns immediately without
> > sleeping. If "timeout" is U64_MAX, the function will
> > sleep until an object is signaled, and will not fail
> > with ETIMEDOUT.
> >
> > * objs: Pointer to an array of "count" file descriptors
> > (specified as an integer so that the structure has the
> > same size regardless of architecture). If any object is
> > invalid, the function fails with EINVAL.
> >
> > * count: Number of objects specified in the "objs" array. If
> > greater than NTSYNC_MAX_WAIT_COUNT, the function fails
> > with EINVAL.
> >
> > * owner: Mutex owner identifier. If any object in "objs" is a
> > mutex, the ioctl will attempt to acquire that mutex on
> > behalf of "owner". If "owner" is zero, the ioctl
> > fails with EINVAL.
>
> Again, should that not be current? That is, why not maintain the NT
> invariant and mandates TIDs and avoid the arguments in both cases?

I don't think there's any particular reason.

> > * index: On success, contains the index (into "objs") of the
> > object which was signaled. If "alert" was signaled
> > instead, this contains "count".
>
> Could be the actual return value, no? Edit: no it cannot be because
> -EOWNERDEAD case below.

Yeah. Again the advice about "only return zero from an ioctl", too.

Although we could also use a bit in the return value (which is also
kind of what NT does).

> > A semaphore is considered to be signaled if its count is nonzero, and
> > is acquired by decrementing its count by one. A mutex is considered
> > to be signaled if it is unowned or if its owner matches the "owner"
> > argument, and is acquired by incrementing its recursion count by one
> > and setting its owner to the "owner" argument. An auto-reset event
> > is acquired by designaling it; a manual-reset event is not affected
> > by acquisition.
> >
> > Acquisition is atomic and totally ordered with respect to other
> > operations on the same object. If two wait operations (with different
> > "owner" identifiers) are queued on the same mutex, only one is
> > signaled. If two wait operations are queued on the same semaphore,
> > and a value of one is posted to it, only one is signaled. The order
> > in which threads are signaled is not specified.
>
> Note that you do list the lack of guarantee here, but not above. I
> suspect both cases are similar and guarantee nothing.

There's no documented guarantee in either case, but when testing in
controlled well-ordered environments, NtWaitForMultipleObjects() always
acquires the lowest index first, and I think wakes are FIFO. I'm not
really sure why I specified the guarantee for the former but not the
latter.

> > The "alert" argument is an "extra" event which can terminate the
> > wait, independently of all other objects. If members of "objs" and
> > "alert" are both simultaneously signaled, a member of "objs" will
> > always be given priority and acquired first.
> >
> > It is valid to pass the same object more than once, including by
> > passing the same event in the "objs" array and in "alert". If a
> > wakeup occurs due to that object being signaled, "index" is set to
> > the lowest index corresponding to that object.
>
> Urgh, is this an actual guarantee? This almost seems to imply that at
> [A] above we can indeed guarantee the lowest indexed object is acquired
> first.

It's definitely legal in NT to pass the same object more than once (in
wait-for-any, not wait-for-all though), and it's definitely the case
that (at least in controlled well-ordered environments) the lowest
index is acquired first. I don't know of any application that
definitely depends on either of these, and they're not documented
behaviours, but Wine has implemented those behaviours and it would make
me nervous to break them here.

The part about passing the same event in "alert" and "objs" is not part
of NT exactly (in NT the "alert" isn't even an event; it's a special
bit of thread state; we just use an event for simplicity). I think I
specified it just to avoid coding an extra check (since it should Just
Work), while also making it clear that case was considered.

> > The function may fail with EINTR if a signal is received.
>
> In which case @index must be disregarded since nothing will be acquired,
> right?
>
> So far nothing really weird, and I'm thinking futexes should be able to
> do all this, no?

Even disregarding wait-for-all, futexes aren't really good enough. Wait
operations need to consume state, and while we could put that state in
user space (and in fact we *do* have an out of tree patch set that kind
of does this) that requires all that state to be shared across
processes, which is a problem since we want processes to be isolated
unless they explicitly share objects with each other. There's not
a scalable way to achieve this, especially since you can share objects
lazily.

You also cannot do NtPulseEvent() this way. The aforementioned patch
set badly emulates it and it does in practice break any application
that uses it. Similarly there are some applications that do a weird
"fake pulse" where they set and then immediately reset an event from
the same thread, and expect that to always wake.



2024-04-17 20:04:28

by Elizabeth Figura

[permalink] [raw]
Subject: Re: [PATCH v4 02/27] ntsync: Introduce NTSYNC_IOC_WAIT_ALL.

On Wednesday, 17 April 2024 06:37:03 CDT Peter Zijlstra wrote:
> On Mon, Apr 15, 2024 at 08:08:12PM -0500, Elizabeth Figura wrote:
> > + if (atomic_read(&sem->all_hint) > 0) {
> > + spin_lock(&dev->wait_all_lock);
> > + spin_lock_nest_lock(&sem->lock, &dev->wait_all_lock);
> >
> > + prev_count = sem->u.sem.count;
> > + ret = post_sem_state(sem, args);
> > + if (!ret) {
> > + try_wake_all_obj(dev, sem);
> > + try_wake_any_sem(sem);
> > + }
> >
> > + spin_unlock(&sem->lock);
> > + spin_unlock(&dev->wait_all_lock);
> > + } else {
> > + spin_lock(&sem->lock);
> > +
> > + prev_count = sem->u.sem.count;
> > + ret = post_sem_state(sem, args);
> > + if (!ret)
> > + try_wake_any_sem(sem);
> > +
> > + spin_unlock(&sem->lock);
> > + }
> >
> > if (!ret && put_user(prev_count, user_args))
> > ret = -EFAULT;
>
> vs.
>
> > + /* queue ourselves */
> > +
> > + spin_lock(&dev->wait_all_lock);
> > +
> > + for (i = 0; i < args.count; i++) {
> > + struct ntsync_q_entry *entry = &q->entries[i];
> > + struct ntsync_obj *obj = entry->obj;
> > +
> > + atomic_inc(&obj->all_hint);
> > +
> > + /*
> > + * obj->all_waiters is protected by dev->wait_all_lock rather
> > + * than obj->lock, so there is no need to acquire obj->lock
> > + * here.
> > + */
> > + list_add_tail(&entry->node, &obj->all_waiters);
> > + }
>
> This looks racy, consider:
>
> atomic_read(all_hints) /* 0 */
>
> spin_lock(wait_all_lock)
> atomic_inc(all_hint) /* 1 */
> list_add_tail()
>
> spin_lock(sem->lock)
> /* try_wake_all_obj() missing */
>
>
>
>
> I've not yet thought about if this is harmful or not, but if not, it
> definitely needs a comment.
>
> Anyway, I need a break, maybe more this evening.

Ach. I wrote this with the idea that the race isn't meaningful, but
looking at it again you're right—there is a harmful race here.

I think it should be fixable by moving the atomic_read inside the lock,
though.



2024-04-18 09:35:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 02/27] ntsync: Introduce NTSYNC_IOC_WAIT_ALL.

On Wed, Apr 17, 2024 at 03:03:05PM -0500, Elizabeth Figura wrote:

> Ach. I wrote this with the idea that the race isn't meaningful, but
> looking at it again you're right—there is a harmful race here.
>
> I think it should be fixable by moving the atomic_read inside the lock,
> though.

Right, I've ended up with the (as yet untested) below. I'll see if I can
find time later to actually test things.

---

--- a/drivers/misc/ntsync.c
+++ b/drivers/misc/ntsync.c
@@ -18,6 +18,7 @@
#include <linux/sched/signal.h>
#include <linux/slab.h>
#include <linux/spinlock.h>
+#include <linux/mutex.h>
#include <uapi/linux/ntsync.h>

#define NTSYNC_NAME "ntsync"
@@ -43,6 +44,7 @@ enum ntsync_type {

struct ntsync_obj {
spinlock_t lock;
+ int dev_locked;

enum ntsync_type type;

@@ -132,7 +134,7 @@ struct ntsync_device {
* wait_all_lock is taken first whenever multiple objects must be locked
* at the same time.
*/
- spinlock_t wait_all_lock;
+ struct mutex wait_all_lock;

struct file *file;
};
@@ -157,6 +159,56 @@ static bool is_signaled(struct ntsync_ob
}

/*
+ * XXX write coherent comment on the locking
+ */
+
+static void dev_lock_obj(struct ntsync_device *dev, struct ntsync_obj *obj)
+{
+ lockdep_assert_held(&dev->wait_all_lock);
+ WARN_ON_ONCE(obj->dev != dev);
+ spin_lock(&obj->lock);
+ obj->dev_locked = 1;
+ spin_unlock(&obj->lock);
+}
+
+static void dev_unlock_obj(struct ntsync_device *dev, struct ntsync_obj *obj)
+{
+ lockdep_assert_held(&dev->wait_all_lock);
+ WARN_ON_ONCE(obj->dev != dev);
+ spin_lock(&obj->lock);
+ obj->dev_locked = 0;
+ spin_unlock(&obj->lock);
+}
+
+static void obj_lock(struct ntsync_obj *obj)
+{
+ struct ntsync_device *dev = obj->dev;
+
+ for (;;) {
+ spin_lock(&obj->lock);
+ if (likely(!obj->dev_locked))
+ break;
+
+ spin_unlock(&obj->lock);
+ mutex_lock(&dev->wait_all_lock);
+ spin_lock(&obj->lock);
+ /*
+ * obj->dev_locked should be set and released under the same
+ * wait_all_lock section, since we now own this lock, it should
+ * be clear.
+ */
+ WARN_ON_ONCE(obj->dev_locked);
+ spin_unlock(&obj->lock);
+ mutex_unlock(&dev->wait_all_lock);
+ }
+}
+
+static void obj_unlock(struct ntsync_obj *obj)
+{
+ spin_unlock(&obj->lock);
+}
+
+/*
* "locked_obj" is an optional pointer to an object which is already locked and
* should not be locked again. This is necessary so that changing an object's
* state and waking it can be a single atomic operation.
@@ -175,7 +227,7 @@ static void try_wake_all(struct ntsync_d

for (i = 0; i < count; i++) {
if (q->entries[i].obj != locked_obj)
- spin_lock_nest_lock(&q->entries[i].obj->lock, &dev->wait_all_lock);
+ dev_lock_obj(dev, q->entries[i].obj);
}

for (i = 0; i < count; i++) {
@@ -211,7 +263,7 @@ static void try_wake_all(struct ntsync_d

for (i = 0; i < count; i++) {
if (q->entries[i].obj != locked_obj)
- spin_unlock(&q->entries[i].obj->lock);
+ dev_unlock_obj(dev, q->entries[i].obj);
}
}

@@ -231,6 +283,7 @@ static void try_wake_any_sem(struct ntsy
struct ntsync_q_entry *entry;

lockdep_assert_held(&sem->lock);
+ WARN_ON_ONCE(sem->type != NTSYNC_TYPE_SEM);

list_for_each_entry(entry, &sem->any_waiters, node) {
struct ntsync_q *q = entry->q;
@@ -251,6 +304,7 @@ static void try_wake_any_mutex(struct nt
struct ntsync_q_entry *entry;

lockdep_assert_held(&mutex->lock);
+ WARN_ON_ONCE(mutex->type != NTSYNC_TYPE_MUTEX);

list_for_each_entry(entry, &mutex->any_waiters, node) {
struct ntsync_q *q = entry->q;
@@ -302,6 +356,7 @@ static int post_sem_state(struct ntsync_
__u32 sum;

lockdep_assert_held(&sem->lock);
+ WARN_ON_ONCE(sem->type != NTSYNC_TYPE_SEM);

if (check_add_overflow(sem->u.sem.count, count, &sum) ||
sum > sem->u.sem.max)
@@ -316,6 +371,7 @@ static int ntsync_sem_post(struct ntsync
struct ntsync_device *dev = sem->dev;
__u32 __user *user_args = argp;
__u32 prev_count;
+ bool all = false;
__u32 args;
int ret;

@@ -325,28 +381,27 @@ static int ntsync_sem_post(struct ntsync
if (sem->type != NTSYNC_TYPE_SEM)
return -EINVAL;

- if (atomic_read(&sem->all_hint) > 0) {
- spin_lock(&dev->wait_all_lock);
- spin_lock_nest_lock(&sem->lock, &dev->wait_all_lock);
-
- prev_count = sem->u.sem.count;
- ret = post_sem_state(sem, args);
- if (!ret) {
+ obj_lock(sem);
+ all = atomic_read(&sem->all_hint);
+ if (unlikely(all)) {
+ obj_unlock(sem);
+ mutex_lock(&dev->wait_all_lock);
+ dev_lock_obj(dev, sem);
+ }
+
+ prev_count = sem->u.sem.count;
+ ret = post_sem_state(sem, args);
+ if (!ret) {
+ if (all)
try_wake_all_obj(dev, sem);
- try_wake_any_sem(sem);
- }
+ try_wake_any_sem(sem);
+ }

- spin_unlock(&sem->lock);
- spin_unlock(&dev->wait_all_lock);
+ if (all) {
+ dev_unlock_obj(dev, sem);
+ mutex_unlock(&dev->wait_all_lock);
} else {
- spin_lock(&sem->lock);
-
- prev_count = sem->u.sem.count;
- ret = post_sem_state(sem, args);
- if (!ret)
- try_wake_any_sem(sem);
-
- spin_unlock(&sem->lock);
+ obj_unlock(sem);
}

if (!ret && put_user(prev_count, user_args))
@@ -376,6 +431,7 @@ static int ntsync_mutex_unlock(struct nt
struct ntsync_mutex_args __user *user_args = argp;
struct ntsync_device *dev = mutex->dev;
struct ntsync_mutex_args args;
+ bool all = false;
__u32 prev_count;
int ret;

@@ -387,28 +443,27 @@ static int ntsync_mutex_unlock(struct nt
if (mutex->type != NTSYNC_TYPE_MUTEX)
return -EINVAL;

- if (atomic_read(&mutex->all_hint) > 0) {
- spin_lock(&dev->wait_all_lock);
- spin_lock_nest_lock(&mutex->lock, &dev->wait_all_lock);
-
- prev_count = mutex->u.mutex.count;
- ret = unlock_mutex_state(mutex, &args);
- if (!ret) {
+ obj_lock(mutex);
+ all = atomic_read(&mutex->all_hint);
+ if (unlikely(all)) {
+ obj_unlock(mutex);
+ mutex_lock(&dev->wait_all_lock);
+ dev_lock_obj(dev, mutex);
+ }
+
+ prev_count = mutex->u.mutex.count;
+ ret = unlock_mutex_state(mutex, &args);
+ if (!ret) {
+ if (all)
try_wake_all_obj(dev, mutex);
- try_wake_any_mutex(mutex);
- }
+ try_wake_any_mutex(mutex);
+ }

- spin_unlock(&mutex->lock);
- spin_unlock(&dev->wait_all_lock);
+ if (all) {
+ dev_unlock_obj(dev, mutex);
+ mutex_unlock(&dev->wait_all_lock);
} else {
- spin_lock(&mutex->lock);
-
- prev_count = mutex->u.mutex.count;
- ret = unlock_mutex_state(mutex, &args);
- if (!ret)
- try_wake_any_mutex(mutex);
-
- spin_unlock(&mutex->lock);
+ obj_unlock(mutex);
}

if (!ret && put_user(prev_count, &user_args->count))
@@ -437,6 +492,7 @@ static int kill_mutex_state(struct ntsyn
static int ntsync_mutex_kill(struct ntsync_obj *mutex, void __user *argp)
{
struct ntsync_device *dev = mutex->dev;
+ bool all = false;
__u32 owner;
int ret;

@@ -448,26 +504,26 @@ static int ntsync_mutex_kill(struct ntsy
if (mutex->type != NTSYNC_TYPE_MUTEX)
return -EINVAL;

- if (atomic_read(&mutex->all_hint) > 0) {
- spin_lock(&dev->wait_all_lock);
- spin_lock_nest_lock(&mutex->lock, &dev->wait_all_lock);
+ obj_lock(mutex);
+ all = atomic_read(&mutex->all_hint);
+ if (unlikely(all)) {
+ obj_unlock(mutex);
+ mutex_lock(&dev->wait_all_lock);
+ dev_lock_obj(dev, mutex);
+ }

- ret = kill_mutex_state(mutex, owner);
- if (!ret) {
+ ret = kill_mutex_state(mutex, owner);
+ if (!ret) {
+ if (all)
try_wake_all_obj(dev, mutex);
- try_wake_any_mutex(mutex);
- }
+ try_wake_any_mutex(mutex);
+ }

- spin_unlock(&mutex->lock);
- spin_unlock(&dev->wait_all_lock);
+ if (all) {
+ dev_unlock_obj(dev, mutex);
+ mutex_unlock(&dev->wait_all_lock);
} else {
- spin_lock(&mutex->lock);
-
- ret = kill_mutex_state(mutex, owner);
- if (!ret)
- try_wake_any_mutex(mutex);
-
- spin_unlock(&mutex->lock);
+ obj_unlock(mutex);
}

return ret;
@@ -477,35 +533,35 @@ static int ntsync_event_set(struct ntsyn
{
struct ntsync_device *dev = event->dev;
__u32 prev_state;
+ bool all = false;

if (event->type != NTSYNC_TYPE_EVENT)
return -EINVAL;

- if (atomic_read(&event->all_hint) > 0) {
- spin_lock(&dev->wait_all_lock);
- spin_lock_nest_lock(&event->lock, &dev->wait_all_lock);
+ obj_lock(event);
+ all = atomic_read(&event->all_hint);
+ if (unlikely(all)) {
+ obj_unlock(event);
+ mutex_lock(&dev->wait_all_lock);
+ dev_lock_obj(dev, event);
+ }

- prev_state = event->u.event.signaled;
- event->u.event.signaled = true;
+ prev_state = event->u.event.signaled;
+ event->u.event.signaled = true;
+ if (all)
try_wake_all_obj(dev, event);
- try_wake_any_event(event);
- if (pulse)
- event->u.event.signaled = false;
-
- spin_unlock(&event->lock);
- spin_unlock(&dev->wait_all_lock);
+ try_wake_any_event(event);
+ if (pulse)
+ event->u.event.signaled = false;
+
+ if (all) {
+ dev_unlock_obj(dev, event);
+ mutex_unlock(&dev->wait_all_lock);
} else {
- spin_lock(&event->lock);
-
- prev_state = event->u.event.signaled;
- event->u.event.signaled = true;
- try_wake_any_event(event);
- if (pulse)
- event->u.event.signaled = false;
-
- spin_unlock(&event->lock);
+ obj_unlock(event);
}

+
if (put_user(prev_state, (__u32 __user *)argp))
return -EFAULT;

@@ -984,7 +1040,7 @@ static int ntsync_wait_all(struct ntsync

/* queue ourselves */

- spin_lock(&dev->wait_all_lock);
+ mutex_lock(&dev->wait_all_lock);

for (i = 0; i < args.count; i++) {
struct ntsync_q_entry *entry = &q->entries[i];
@@ -1004,7 +1060,7 @@ static int ntsync_wait_all(struct ntsync

try_wake_all(dev, q, NULL);

- spin_unlock(&dev->wait_all_lock);
+ mutex_unlock(&dev->wait_all_lock);

/* sleep */

@@ -1012,7 +1068,7 @@ static int ntsync_wait_all(struct ntsync

/* and finally, unqueue */

- spin_lock(&dev->wait_all_lock);
+ mutex_lock(&dev->wait_all_lock);

for (i = 0; i < args.count; i++) {
struct ntsync_q_entry *entry = &q->entries[i];
@@ -1029,7 +1085,7 @@ static int ntsync_wait_all(struct ntsync
put_obj(obj);
}

- spin_unlock(&dev->wait_all_lock);
+ mutex_unlock(&dev->wait_all_lock);

signaled = atomic_read(&q->signaled);
if (signaled != -1) {
@@ -1056,7 +1112,7 @@ static int ntsync_char_open(struct inode
if (!dev)
return -ENOMEM;

- spin_lock_init(&dev->wait_all_lock);
+ mutex_init(&dev->wait_all_lock);

file->private_data = dev;
dev->file = file;

2024-04-19 16:16:47

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 00/30] NT synchronization primitive driver

On Tue, Apr 16, 2024 at 05:18:56PM -0500, Elizabeth Figura wrote:
> On Tuesday, 16 April 2024 16:18:24 CDT Elizabeth Figura wrote:
> > On Tuesday, 16 April 2024 03:14:21 CDT Peter Zijlstra wrote:
> > > I don't support GE has it in his builds? Last time I tried, building
> > > Wine was a bit of a pain.
> >
> > It doesn't seem so. I tried to build a GE-compatible ntsync build, uploaded
> > here (thanks Arek for hosting):
> >
> > https://f002.backblazeb2.com/file/wine-ntsync/ntsync-wine.tar.xz
>
> Oops, the initial version I uploaded had broken paths. Should be fixed now.
>
> (It's also broken on an unpatched kernel unless explicitly disabled with
> WINE_DISABLE_FAST_SYNC=1. Not sure what I messed up there—it should fall back
> cleanly—but hopefully shouldn't be too important for testing.)

So I've tried using that wine build with lutris, and I can't get it to
start EGS or anything else.

I even added a printk to the ntsync driver for every open, to see if it
gets that far, but I'm not even getting that :/


2024-04-19 16:28:35

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 02/27] ntsync: Introduce NTSYNC_IOC_WAIT_ALL.

On Thu, Apr 18, 2024 at 11:35:11AM +0200, Peter Zijlstra wrote:
> On Wed, Apr 17, 2024 at 03:03:05PM -0500, Elizabeth Figura wrote:
>
> > Ach. I wrote this with the idea that the race isn't meaningful, but
> > looking at it again you're right—there is a harmful race here.
> >
> > I think it should be fixable by moving the atomic_read inside the lock,
> > though.
>
> Right, I've ended up with the (as yet untested) below. I'll see if I can
> find time later to actually test things.

Latest hackery... I tried testing this but I'm not having luck using the
patched wine as per the other email.

---
--- a/drivers/misc/ntsync.c
+++ b/drivers/misc/ntsync.c
@@ -18,6 +18,7 @@
#include <linux/sched/signal.h>
#include <linux/slab.h>
#include <linux/spinlock.h>
+#include <linux/mutex.h>
#include <uapi/linux/ntsync.h>

#define NTSYNC_NAME "ntsync"
@@ -43,6 +44,7 @@ enum ntsync_type {

struct ntsync_obj {
spinlock_t lock;
+ int dev_locked;

enum ntsync_type type;

@@ -132,14 +134,107 @@ struct ntsync_device {
* wait_all_lock is taken first whenever multiple objects must be locked
* at the same time.
*/
- spinlock_t wait_all_lock;
+ struct mutex wait_all_lock;

struct file *file;
};

+/*
+ * Single objects are locked using obj->lock.
+ *
+ * Multiple objects are 'locked' while holding dev->wait_all_lock to avoid lock
+ * order issues. In this case however, single objects are not locked by holding
+ * obj->lock, but by setting obj->dev_locked.
+ *
+ * This means that in order to lock a single object, the sequence is slightly
+ * more complicated than usual. Specifically it needs to check obj->dev_locked
+ * after acquiring obj->lock, if set, it needs to drop the lock and acquire
+ * dev->wait_all_lock in order to serialize against the multi-object operation.
+ */
+
+static void dev_lock_obj(struct ntsync_device *dev, struct ntsync_obj *obj)
+{
+ lockdep_assert_held(&dev->wait_all_lock);
+ lockdep_assert(obj->dev == dev);
+ spin_lock(&obj->lock);
+ /*
+ * By setting obj->dev_locked inside obj->lock, it is ensured that
+ * anyone holding obj->lock must see the value.
+ */
+ obj->dev_locked = 1;
+ spin_unlock(&obj->lock);
+}
+
+static void dev_unlock_obj(struct ntsync_device *dev, struct ntsync_obj *obj)
+{
+ lockdep_assert_held(&dev->wait_all_lock);
+ lockdep_assert(obj->dev == dev);
+ spin_lock(&obj->lock);
+ obj->dev_locked = 0;
+ spin_unlock(&obj->lock);
+}
+
+static void obj_lock(struct ntsync_obj *obj)
+{
+ struct ntsync_device *dev = obj->dev;
+
+ for (;;) {
+ spin_lock(&obj->lock);
+ if (likely(!obj->dev_locked))
+ break;
+
+ spin_unlock(&obj->lock);
+ mutex_lock(&dev->wait_all_lock);
+ spin_lock(&obj->lock);
+ /*
+ * obj->dev_locked should be set and released under the same
+ * wait_all_lock section, since we now own this lock, it should
+ * be clear.
+ */
+ lockdep_assert(!obj->dev_locked);
+ spin_unlock(&obj->lock);
+ mutex_unlock(&dev->wait_all_lock);
+ }
+}
+
+static void obj_unlock(struct ntsync_obj *obj)
+{
+ spin_unlock(&obj->lock);
+}
+
+static bool ntsync_lock_obj(struct ntsync_device *dev, struct ntsync_obj *obj)
+{
+ bool all;
+
+ obj_lock(obj);
+ all = atomic_read(&obj->all_hint);
+ if (unlikely(all)) {
+ obj_unlock(obj);
+ mutex_lock(&dev->wait_all_lock);
+ dev_lock_obj(dev, obj);
+ }
+
+ return all;
+}
+
+static void ntsync_unlock_obj(struct ntsync_device *dev, struct ntsync_obj *obj, bool all)
+{
+ if (all) {
+ dev_unlock_obj(dev, obj);
+ mutex_unlock(&dev->wait_all_lock);
+ } else {
+ obj_unlock(obj);
+ }
+}
+
+#define ntsync_assert_held(obj) \
+ lockdep_assert((lockdep_is_held(&(obj)->lock) != LOCK_STATE_NOT_HELD) || \
+ ((lockdep_is_held(&(obj)->dev->wait_all_lock) != LOCK_STATE_NOT_HELD) && \
+ (obj)->dev_locked))
+
static bool is_signaled(struct ntsync_obj *obj, __u32 owner)
{
- lockdep_assert_held(&obj->lock);
+ ntsync_assert_held(obj);

switch (obj->type) {
case NTSYNC_TYPE_SEM:
@@ -171,11 +266,11 @@ static void try_wake_all(struct ntsync_d

lockdep_assert_held(&dev->wait_all_lock);
if (locked_obj)
- lockdep_assert_held(&locked_obj->lock);
+ lockdep_assert(locked_obj->dev_locked);

for (i = 0; i < count; i++) {
if (q->entries[i].obj != locked_obj)
- spin_lock_nest_lock(&q->entries[i].obj->lock, &dev->wait_all_lock);
+ dev_lock_obj(dev, q->entries[i].obj);
}

for (i = 0; i < count; i++) {
@@ -211,7 +306,7 @@ static void try_wake_all(struct ntsync_d

for (i = 0; i < count; i++) {
if (q->entries[i].obj != locked_obj)
- spin_unlock(&q->entries[i].obj->lock);
+ dev_unlock_obj(dev, q->entries[i].obj);
}
}

@@ -220,7 +315,7 @@ static void try_wake_all_obj(struct ntsy
struct ntsync_q_entry *entry;

lockdep_assert_held(&dev->wait_all_lock);
- lockdep_assert_held(&obj->lock);
+ lockdep_assert(obj->dev_locked);

list_for_each_entry(entry, &obj->all_waiters, node)
try_wake_all(dev, entry->q, obj);
@@ -230,7 +325,8 @@ static void try_wake_any_sem(struct ntsy
{
struct ntsync_q_entry *entry;

- lockdep_assert_held(&sem->lock);
+ ntsync_assert_held(sem);
+ lockdep_assert(sem->type == NTSYNC_TYPE_SEM);

list_for_each_entry(entry, &sem->any_waiters, node) {
struct ntsync_q *q = entry->q;
@@ -250,7 +346,8 @@ static void try_wake_any_mutex(struct nt
{
struct ntsync_q_entry *entry;

- lockdep_assert_held(&mutex->lock);
+ ntsync_assert_held(mutex);
+ lockdep_assert(mutex->type == NTSYNC_TYPE_MUTEX);

list_for_each_entry(entry, &mutex->any_waiters, node) {
struct ntsync_q *q = entry->q;
@@ -276,7 +373,8 @@ static void try_wake_any_event(struct nt
{
struct ntsync_q_entry *entry;

- lockdep_assert_held(&event->lock);
+ ntsync_assert_held(event);
+ lockdep_assert(event->type == NTSYNC_TYPE_EVENT);

list_for_each_entry(entry, &event->any_waiters, node) {
struct ntsync_q *q = entry->q;
@@ -302,6 +400,7 @@ static int post_sem_state(struct ntsync_
__u32 sum;

lockdep_assert_held(&sem->lock);
+ lockdep_assert(sem->type == NTSYNC_TYPE_SEM);

if (check_add_overflow(sem->u.sem.count, count, &sum) ||
sum > sem->u.sem.max)
@@ -317,6 +416,7 @@ static int ntsync_sem_post(struct ntsync
__u32 __user *user_args = argp;
__u32 prev_count;
__u32 args;
+ bool all;
int ret;

if (copy_from_user(&args, argp, sizeof(args)))
@@ -325,30 +425,18 @@ static int ntsync_sem_post(struct ntsync
if (sem->type != NTSYNC_TYPE_SEM)
return -EINVAL;

- if (atomic_read(&sem->all_hint) > 0) {
- spin_lock(&dev->wait_all_lock);
- spin_lock_nest_lock(&sem->lock, &dev->wait_all_lock);
-
- prev_count = sem->u.sem.count;
- ret = post_sem_state(sem, args);
- if (!ret) {
- try_wake_all_obj(dev, sem);
- try_wake_any_sem(sem);
- }
-
- spin_unlock(&sem->lock);
- spin_unlock(&dev->wait_all_lock);
- } else {
- spin_lock(&sem->lock);
+ all = ntsync_lock_obj(dev, sem);

- prev_count = sem->u.sem.count;
- ret = post_sem_state(sem, args);
- if (!ret)
- try_wake_any_sem(sem);
-
- spin_unlock(&sem->lock);
+ prev_count = sem->u.sem.count;
+ ret = post_sem_state(sem, args);
+ if (!ret) {
+ if (all)
+ try_wake_all_obj(dev, sem);
+ try_wake_any_sem(sem);
}

+ ntsync_unlock_obj(dev, sem, all);
+
if (!ret && put_user(prev_count, user_args))
ret = -EFAULT;

@@ -377,6 +465,7 @@ static int ntsync_mutex_unlock(struct nt
struct ntsync_device *dev = mutex->dev;
struct ntsync_mutex_args args;
__u32 prev_count;
+ bool all;
int ret;

if (copy_from_user(&args, argp, sizeof(args)))
@@ -387,30 +476,18 @@ static int ntsync_mutex_unlock(struct nt
if (mutex->type != NTSYNC_TYPE_MUTEX)
return -EINVAL;

- if (atomic_read(&mutex->all_hint) > 0) {
- spin_lock(&dev->wait_all_lock);
- spin_lock_nest_lock(&mutex->lock, &dev->wait_all_lock);
-
- prev_count = mutex->u.mutex.count;
- ret = unlock_mutex_state(mutex, &args);
- if (!ret) {
- try_wake_all_obj(dev, mutex);
- try_wake_any_mutex(mutex);
- }
+ all = ntsync_lock_obj(dev, mutex);

- spin_unlock(&mutex->lock);
- spin_unlock(&dev->wait_all_lock);
- } else {
- spin_lock(&mutex->lock);
-
- prev_count = mutex->u.mutex.count;
- ret = unlock_mutex_state(mutex, &args);
- if (!ret)
- try_wake_any_mutex(mutex);
-
- spin_unlock(&mutex->lock);
+ prev_count = mutex->u.mutex.count;
+ ret = unlock_mutex_state(mutex, &args);
+ if (!ret) {
+ if (all)
+ try_wake_all_obj(dev, mutex);
+ try_wake_any_mutex(mutex);
}

+ ntsync_unlock_obj(dev, mutex, all);
+
if (!ret && put_user(prev_count, &user_args->count))
ret = -EFAULT;

@@ -438,6 +515,7 @@ static int ntsync_mutex_kill(struct ntsy
{
struct ntsync_device *dev = mutex->dev;
__u32 owner;
+ bool all;
int ret;

if (get_user(owner, (__u32 __user *)argp))
@@ -448,28 +526,17 @@ static int ntsync_mutex_kill(struct ntsy
if (mutex->type != NTSYNC_TYPE_MUTEX)
return -EINVAL;

- if (atomic_read(&mutex->all_hint) > 0) {
- spin_lock(&dev->wait_all_lock);
- spin_lock_nest_lock(&mutex->lock, &dev->wait_all_lock);
+ all = ntsync_lock_obj(dev, mutex);

- ret = kill_mutex_state(mutex, owner);
- if (!ret) {
+ ret = kill_mutex_state(mutex, owner);
+ if (!ret) {
+ if (all)
try_wake_all_obj(dev, mutex);
- try_wake_any_mutex(mutex);
- }
-
- spin_unlock(&mutex->lock);
- spin_unlock(&dev->wait_all_lock);
- } else {
- spin_lock(&mutex->lock);
-
- ret = kill_mutex_state(mutex, owner);
- if (!ret)
- try_wake_any_mutex(mutex);
-
- spin_unlock(&mutex->lock);
+ try_wake_any_mutex(mutex);
}

+ ntsync_unlock_obj(dev, mutex, all);
+
return ret;
}

@@ -477,34 +544,22 @@ static int ntsync_event_set(struct ntsyn
{
struct ntsync_device *dev = event->dev;
__u32 prev_state;
+ bool all;

if (event->type != NTSYNC_TYPE_EVENT)
return -EINVAL;

- if (atomic_read(&event->all_hint) > 0) {
- spin_lock(&dev->wait_all_lock);
- spin_lock_nest_lock(&event->lock, &dev->wait_all_lock);
+ all = ntsync_lock_obj(dev, event);

- prev_state = event->u.event.signaled;
- event->u.event.signaled = true;
+ prev_state = event->u.event.signaled;
+ event->u.event.signaled = true;
+ if (all)
try_wake_all_obj(dev, event);
- try_wake_any_event(event);
- if (pulse)
- event->u.event.signaled = false;
+ try_wake_any_event(event);
+ if (pulse)
+ event->u.event.signaled = false;

- spin_unlock(&event->lock);
- spin_unlock(&dev->wait_all_lock);
- } else {
- spin_lock(&event->lock);
-
- prev_state = event->u.event.signaled;
- event->u.event.signaled = true;
- try_wake_any_event(event);
- if (pulse)
- event->u.event.signaled = false;
-
- spin_unlock(&event->lock);
- }
+ ntsync_unlock_obj(dev, event, all);

if (put_user(prev_state, (__u32 __user *)argp))
return -EFAULT;
@@ -984,7 +1039,7 @@ static int ntsync_wait_all(struct ntsync

/* queue ourselves */

- spin_lock(&dev->wait_all_lock);
+ mutex_lock(&dev->wait_all_lock);

for (i = 0; i < args.count; i++) {
struct ntsync_q_entry *entry = &q->entries[i];
@@ -1004,7 +1059,7 @@ static int ntsync_wait_all(struct ntsync

try_wake_all(dev, q, NULL);

- spin_unlock(&dev->wait_all_lock);
+ mutex_unlock(&dev->wait_all_lock);

/* sleep */

@@ -1012,7 +1067,7 @@ static int ntsync_wait_all(struct ntsync

/* and finally, unqueue */

- spin_lock(&dev->wait_all_lock);
+ mutex_lock(&dev->wait_all_lock);

for (i = 0; i < args.count; i++) {
struct ntsync_q_entry *entry = &q->entries[i];
@@ -1029,7 +1084,7 @@ static int ntsync_wait_all(struct ntsync
put_obj(obj);
}

- spin_unlock(&dev->wait_all_lock);
+ mutex_unlock(&dev->wait_all_lock);

signaled = atomic_read(&q->signaled);
if (signaled != -1) {
@@ -1056,7 +1111,7 @@ static int ntsync_char_open(struct inode
if (!dev)
return -ENOMEM;

- spin_lock_init(&dev->wait_all_lock);
+ mutex_init(&dev->wait_all_lock);

file->private_data = dev;
dev->file = file;

2024-04-19 20:46:30

by Elizabeth Figura

[permalink] [raw]
Subject: Re: [PATCH v4 00/30] NT synchronization primitive driver

On Friday, 19 April 2024 11:16:11 CDT Peter Zijlstra wrote:
> On Tue, Apr 16, 2024 at 05:18:56PM -0500, Elizabeth Figura wrote:
> > On Tuesday, 16 April 2024 16:18:24 CDT Elizabeth Figura wrote:
> > > On Tuesday, 16 April 2024 03:14:21 CDT Peter Zijlstra wrote:
> > > > I don't support GE has it in his builds? Last time I tried, building
> > > > Wine was a bit of a pain.
> > >
> > > It doesn't seem so. I tried to build a GE-compatible ntsync build, uploaded
> > > here (thanks Arek for hosting):
> > >
> > > https://f002.backblazeb2.com/file/wine-ntsync/ntsync-wine.tar.xz
> >
> > Oops, the initial version I uploaded had broken paths. Should be fixed now.
> >
> > (It's also broken on an unpatched kernel unless explicitly disabled with
> > WINE_DISABLE_FAST_SYNC=1. Not sure what I messed up there—it should fall back
> > cleanly—but hopefully shouldn't be too important for testing.)
>
> So I've tried using that wine build with lutris, and I can't get it to
> start EGS or anything else.
>
> I even added a printk to the ntsync driver for every open, to see if it
> gets that far, but I'm not even getting that :/

That's odd, it works for me, both as a standalone build and with
lutris...

Does /dev/ntsync exist (module is loaded) and have nonzero permissions?
I forgot to mention that's necessary, sorry.

Otherwise I can try to look at an strace, or a Wine debug log. I don't
think there's an easy way to get the latter with Lutris, but something
like `WINEDEBUG=+all ./wine winecfg 2>log` should work.