2022-03-31 05:05:49

by Avi Kivity

[permalink] [raw]
Subject: sched_min_granuality_ns exile into debugfs

Hi Peter,


In 8a99b683 ("sched: Move SCHED_DEBUG sysctl to debugfs"), you moved

sched_min_granularity_ns to debugfs, citing that it is debug-only (true)

and undocumented (it is documented in sched-design-CFS.rst, under

the old name).


This breaks my application, Scylla[1]. We use sched_min_granularity_ns

to reduce the chances that a high networking backlog will starve the

application thread. It is a thread-per-core design, so we won't find another

core for the application, they are all busy (and besides, the application

threads are pinned).


In addition to sched_min_granularity_ns, we also tune a few other

sysctls:


# Prevent auto-scaling from doing anything to our tunables
kernel.sched_tunable_scaling = 0

# Preempt sooner
kernel.sched_min_granularity_ns = 500000

# Don't delay unrelated workloads
kernel.sched_wakeup_granularity_ns = 450000

# Schedule all tasks in this period
kernel.sched_latency_ns = 1000000

# autogroup seems to prevent sched_latency_ns from being respected
kernel.sched_autogroup_enabled = 0

# Disable numa balancing
kernel.numa_balancing = 0


While we can adapt to the move, I would much prefer it if the old location

was restored. I think it even makes sense to make this a non-debug tunable;

it helps to application to be more responsive without using the realtime

class, which is its own can of worms (and will likely result in reduced
throughput).


[1] https://github.com/scylladb/scylla