This series was never applied[1], and was recently pointed out as
missing[2]. If someone has a tree for this, please take it. Otherwise,
please Ack and I'll send it to Linus.
Thanks!
-Kees
[1] https://lore.kernel.org/lkml/[email protected]/
[2] https://lore.kernel.org/lkml/[email protected]/
Elena Reshetova (3):
nsproxy: convert nsproxy.count to refcount_t
groups: convert group_info.usage to refcount_t
creds: convert cred.usage to refcount_t
include/linux/cred.h | 15 +++++++-------
include/linux/nsproxy.h | 7 +++----
kernel/cred.c | 44 ++++++++++++++++++++---------------------
kernel/groups.c | 2 +-
kernel/nsproxy.c | 6 +++---
net/sunrpc/auth.c | 2 +-
6 files changed, 38 insertions(+), 38 deletions(-)
--
2.25.1
From: Elena Reshetova <[email protected]>
atomic_t variables are currently used to implement reference
counters with the following properties:
- counter is initialized to 1 using atomic_set()
- a resource is freed upon counter reaching zero
- once counter reaches zero, its further
increments aren't allowed
- counter schema uses basic atomic operations
(set, inc, inc_not_zero, dec_and_test, etc.)
Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.
The variable cred.usage is used as pure reference counter.
Convert it to refcount_t and fix up the operations.
**Important note for maintainers:
Some functions from refcount_t API defined in lib/refcount.c
have different memory ordering guarantees than their atomic
counterparts.Please check Documentation/core-api/refcount-vs-atomic.rst
for more information.
Normally the differences should not matter since refcount_t provides
enough guarantees to satisfy the refcounting use cases, but in
some rare cases it might matter.
Please double check that you don't have some undocumented
memory guarantees for this variable usage.
For the cred.usage it might make a difference
in following places:
- get_task_cred(): increment in refcount_inc_not_zero() only
guarantees control dependency on success vs. fully ordered
atomic counterpart
- put_cred(): decrement in refcount_dec_and_test() only
provides RELEASE ordering and ACQUIRE ordering on success
vs. fully ordered atomic counterpart
Suggested-by: Kees Cook <[email protected]>
Signed-off-by: Elena Reshetova <[email protected]>
Reviewed-by: David Windsor <[email protected]>
Reviewed-by: Hans Liljestrand <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Kees Cook <[email protected]>
---
include/linux/cred.h | 8 ++++----
kernel/cred.c | 42 +++++++++++++++++++++---------------------
net/sunrpc/auth.c | 2 +-
3 files changed, 26 insertions(+), 26 deletions(-)
diff --git a/include/linux/cred.h b/include/linux/cred.h
index 32be0daf5a32..d624b91de095 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -110,7 +110,7 @@ static inline int groups_search(const struct group_info *group_info, kgid_t grp)
* same context as task->real_cred.
*/
struct cred {
- atomic_t usage;
+ refcount_t usage;
#ifdef CONFIG_DEBUG_CREDENTIALS
atomic_t subscribers; /* number of processes subscribed */
void *put_addr;
@@ -228,7 +228,7 @@ static inline bool cap_ambient_invariant_ok(const struct cred *cred)
*/
static inline struct cred *get_new_cred(struct cred *cred)
{
- atomic_inc(&cred->usage);
+ refcount_inc(&cred->usage);
return cred;
}
@@ -260,7 +260,7 @@ static inline const struct cred *get_cred_rcu(const struct cred *cred)
struct cred *nonconst_cred = (struct cred *) cred;
if (!cred)
return NULL;
- if (!atomic_inc_not_zero(&nonconst_cred->usage))
+ if (!refcount_inc_not_zero(&nonconst_cred->usage))
return NULL;
validate_creds(cred);
nonconst_cred->non_rcu = 0;
@@ -284,7 +284,7 @@ static inline void put_cred(const struct cred *_cred)
if (cred) {
validate_creds(cred);
- if (atomic_dec_and_test(&(cred)->usage))
+ if (refcount_dec_and_test(&(cred)->usage))
__put_cred(cred);
}
}
diff --git a/kernel/cred.c b/kernel/cred.c
index ed95117d07c4..efde7ded71d4 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -39,7 +39,7 @@ struct group_info init_groups = { .usage = REFCOUNT_INIT(2) };
* The initial credentials for the initial task
*/
struct cred init_cred = {
- .usage = ATOMIC_INIT(4),
+ .usage = REFCOUNT_INIT(4),
#ifdef CONFIG_DEBUG_CREDENTIALS
.subscribers = ATOMIC_INIT(2),
.magic = CRED_MAGIC,
@@ -98,17 +98,17 @@ static void put_cred_rcu(struct rcu_head *rcu)
#ifdef CONFIG_DEBUG_CREDENTIALS
if (cred->magic != CRED_MAGIC_DEAD ||
- atomic_read(&cred->usage) != 0 ||
+ refcount_read(&cred->usage) != 0 ||
read_cred_subscribers(cred) != 0)
panic("CRED: put_cred_rcu() sees %p with"
" mag %x, put %p, usage %d, subscr %d\n",
cred, cred->magic, cred->put_addr,
- atomic_read(&cred->usage),
+ refcount_read(&cred->usage),
read_cred_subscribers(cred));
#else
- if (atomic_read(&cred->usage) != 0)
+ if (refcount_read(&cred->usage) != 0)
panic("CRED: put_cred_rcu() sees %p with usage %d\n",
- cred, atomic_read(&cred->usage));
+ cred, refcount_read(&cred->usage));
#endif
security_cred_free(cred);
@@ -132,10 +132,10 @@ static void put_cred_rcu(struct rcu_head *rcu)
void __put_cred(struct cred *cred)
{
kdebug("__put_cred(%p{%d,%d})", cred,
- atomic_read(&cred->usage),
+ refcount_read(&cred->usage),
read_cred_subscribers(cred));
- BUG_ON(atomic_read(&cred->usage) != 0);
+ BUG_ON(refcount_read(&cred->usage) != 0);
#ifdef CONFIG_DEBUG_CREDENTIALS
BUG_ON(read_cred_subscribers(cred) != 0);
cred->magic = CRED_MAGIC_DEAD;
@@ -159,7 +159,7 @@ void exit_creds(struct task_struct *tsk)
struct cred *cred;
kdebug("exit_creds(%u,%p,%p,{%d,%d})", tsk->pid, tsk->real_cred, tsk->cred,
- atomic_read(&tsk->cred->usage),
+ refcount_read(&tsk->cred->usage),
read_cred_subscribers(tsk->cred));
cred = (struct cred *) tsk->real_cred;
@@ -218,7 +218,7 @@ struct cred *cred_alloc_blank(void)
if (!new)
return NULL;
- atomic_set(&new->usage, 1);
+ refcount_set(&new->usage, 1);
#ifdef CONFIG_DEBUG_CREDENTIALS
new->magic = CRED_MAGIC;
#endif
@@ -265,7 +265,7 @@ struct cred *prepare_creds(void)
memcpy(new, old, sizeof(struct cred));
new->non_rcu = 0;
- atomic_set(&new->usage, 1);
+ refcount_set(&new->usage, 1);
set_cred_subscribers(new, 0);
get_group_info(new->group_info);
get_uid(new->user);
@@ -349,7 +349,7 @@ int copy_creds(struct task_struct *p, unsigned long clone_flags)
get_cred(p->cred);
alter_cred_subscribers(p->cred, 2);
kdebug("share_creds(%p{%d,%d})",
- p->cred, atomic_read(&p->cred->usage),
+ p->cred, refcount_read(&p->cred->usage),
read_cred_subscribers(p->cred));
atomic_inc(&p->cred->user->processes);
return 0;
@@ -440,7 +440,7 @@ int commit_creds(struct cred *new)
const struct cred *old = task->real_cred;
kdebug("commit_creds(%p{%d,%d})", new,
- atomic_read(&new->usage),
+ refcount_read(&new->usage),
read_cred_subscribers(new));
BUG_ON(task->cred != old);
@@ -449,7 +449,7 @@ int commit_creds(struct cred *new)
validate_creds(old);
validate_creds(new);
#endif
- BUG_ON(atomic_read(&new->usage) < 1);
+ BUG_ON(refcount_read(&new->usage) < 1);
get_cred(new); /* we will require a ref for the subj creds too */
@@ -523,13 +523,13 @@ EXPORT_SYMBOL(commit_creds);
void abort_creds(struct cred *new)
{
kdebug("abort_creds(%p{%d,%d})", new,
- atomic_read(&new->usage),
+ refcount_read(&new->usage),
read_cred_subscribers(new));
#ifdef CONFIG_DEBUG_CREDENTIALS
BUG_ON(read_cred_subscribers(new) != 0);
#endif
- BUG_ON(atomic_read(&new->usage) < 1);
+ BUG_ON(refcount_read(&new->usage) < 1);
put_cred(new);
}
EXPORT_SYMBOL(abort_creds);
@@ -546,7 +546,7 @@ const struct cred *override_creds(const struct cred *new)
const struct cred *old = current->cred;
kdebug("override_creds(%p{%d,%d})", new,
- atomic_read(&new->usage),
+ refcount_read(&new->usage),
read_cred_subscribers(new));
validate_creds(old);
@@ -569,7 +569,7 @@ const struct cred *override_creds(const struct cred *new)
alter_cred_subscribers(old, -1);
kdebug("override_creds() = %p{%d,%d}", old,
- atomic_read(&old->usage),
+ refcount_read(&old->usage),
read_cred_subscribers(old));
return old;
}
@@ -587,7 +587,7 @@ void revert_creds(const struct cred *old)
const struct cred *override = current->cred;
kdebug("revert_creds(%p{%d,%d})", old,
- atomic_read(&old->usage),
+ refcount_read(&old->usage),
read_cred_subscribers(old));
validate_creds(old);
@@ -699,7 +699,7 @@ struct cred *prepare_kernel_cred(struct task_struct *daemon)
*new = *old;
new->non_rcu = 0;
- atomic_set(&new->usage, 1);
+ refcount_set(&new->usage, 1);
set_cred_subscribers(new, 0);
get_uid(new->user);
get_user_ns(new->user_ns);
@@ -810,7 +810,7 @@ static void dump_invalid_creds(const struct cred *cred, const char *label,
printk(KERN_ERR "CRED: ->magic=%x, put_addr=%p\n",
cred->magic, cred->put_addr);
printk(KERN_ERR "CRED: ->usage=%d, subscr=%d\n",
- atomic_read(&cred->usage),
+ refcount_read(&cred->usage),
read_cred_subscribers(cred));
printk(KERN_ERR "CRED: ->*uid = { %d,%d,%d,%d }\n",
from_kuid_munged(&init_user_ns, cred->uid),
@@ -884,7 +884,7 @@ void validate_creds_for_do_exit(struct task_struct *tsk)
{
kdebug("validate_creds_for_do_exit(%p,%p{%d,%d})",
tsk->real_cred, tsk->cred,
- atomic_read(&tsk->cred->usage),
+ refcount_read(&tsk->cred->usage),
read_cred_subscribers(tsk->cred));
__validate_process_creds(tsk, __FILE__, __LINE__);
diff --git a/net/sunrpc/auth.c b/net/sunrpc/auth.c
index a9f0d17fdb0d..b2b7195536ae 100644
--- a/net/sunrpc/auth.c
+++ b/net/sunrpc/auth.c
@@ -39,7 +39,7 @@ static LIST_HEAD(cred_unused);
static unsigned long number_cred_unused;
static struct cred machine_cred = {
- .usage = ATOMIC_INIT(1),
+ .usage = REFCOUNT_INIT(1),
#ifdef CONFIG_DEBUG_CREDENTIALS
.magic = CRED_MAGIC,
#endif
--
2.25.1
From: Elena Reshetova <[email protected]>
atomic_t variables are currently used to implement reference
counters with the following properties:
- counter is initialized to 1 using atomic_set()
- a resource is freed upon counter reaching zero
- once counter reaches zero, its further
increments aren't allowed
- counter schema uses basic atomic operations
(set, inc, inc_not_zero, dec_and_test, etc.)
Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.
The variable nsproxy.count is used as pure reference counter.
Convert it to refcount_t and fix up the operations.
**Important note for maintainers:
Some functions from refcount_t API defined in lib/refcount.c
have different memory ordering guarantees than their atomic
counterparts.Please check Documentation/core-api/refcount-vs-atomic.rst
for more information.
Normally the differences should not matter since refcount_t provides
enough guarantees to satisfy the refcounting use cases, but in
some rare cases it might matter.
Please double check that you don't have some undocumented
memory guarantees for this variable usage.
For the nsproxy.count it might make a difference
in following places:
- put_nsproxy() and switch_task_namespaces(): decrement in
refcount_dec_and_test() only provides RELEASE ordering
and ACQUIRE ordering on success vs. fully ordered
atomic counterpart
Suggested-by: Kees Cook <[email protected]>
Signed-off-by: Elena Reshetova <[email protected]>
Reviewed-by: David Windsor <[email protected]>
Reviewed-by: Hans Liljestrand <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Kees Cook <[email protected]>
---
include/linux/nsproxy.h | 7 +++----
kernel/nsproxy.c | 6 +++---
2 files changed, 6 insertions(+), 7 deletions(-)
diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index cdb171efc7cb..6bd95f02de05 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -29,7 +29,7 @@ struct fs_struct;
* nsproxy is copied.
*/
struct nsproxy {
- atomic_t count;
+ refcount_t count;
struct uts_namespace *uts_ns;
struct ipc_namespace *ipc_ns;
struct mnt_namespace *mnt_ns;
@@ -101,14 +101,13 @@ int __init nsproxy_cache_init(void);
static inline void put_nsproxy(struct nsproxy *ns)
{
- if (atomic_dec_and_test(&ns->count)) {
+ if (refcount_dec_and_test(&ns->count))
free_nsproxy(ns);
- }
}
static inline void get_nsproxy(struct nsproxy *ns)
{
- atomic_inc(&ns->count);
+ refcount_inc(&ns->count);
}
#endif
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index b03df67621d0..74b5e6925e0f 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -30,7 +30,7 @@
static struct kmem_cache *nsproxy_cachep;
struct nsproxy init_nsproxy = {
- .count = ATOMIC_INIT(1),
+ .count = REFCOUNT_INIT(1),
.uts_ns = &init_uts_ns,
#if defined(CONFIG_POSIX_MQUEUE) || defined(CONFIG_SYSVIPC)
.ipc_ns = &init_ipc_ns,
@@ -55,7 +55,7 @@ static inline struct nsproxy *create_nsproxy(void)
nsproxy = kmem_cache_alloc(nsproxy_cachep, GFP_KERNEL);
if (nsproxy)
- atomic_set(&nsproxy->count, 1);
+ refcount_set(&nsproxy->count, 1);
return nsproxy;
}
@@ -250,7 +250,7 @@ void switch_task_namespaces(struct task_struct *p, struct nsproxy *new)
p->nsproxy = new;
task_unlock(p);
- if (ns && atomic_dec_and_test(&ns->count))
+ if (ns && refcount_dec_and_test(&ns->count))
free_nsproxy(ns);
}
--
2.25.1
From: Elena Reshetova <[email protected]>
atomic_t variables are currently used to implement reference
counters with the following properties:
- counter is initialized to 1 using atomic_set()
- a resource is freed upon counter reaching zero
- once counter reaches zero, its further
increments aren't allowed
- counter schema uses basic atomic operations
(set, inc, inc_not_zero, dec_and_test, etc.)
Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.
The variable group_info.usage is used as pure reference counter.
Convert it to refcount_t and fix up the operations.
**Important note for maintainers:
Some functions from refcount_t API defined in lib/refcount.c
have different memory ordering guarantees than their atomic
counterparts. Please check Documentation/core-api/refcount-vs-atomic.rst
for more information.
Normally the differences should not matter since refcount_t provides
enough guarantees to satisfy the refcounting use cases, but in
some rare cases it might matter.
Please double check that you don't have some undocumented
memory guarantees for this variable usage.
For the group_info.usage it might make a difference
in following places:
- put_group_info(): decrement in refcount_dec_and_test() only
provides RELEASE ordering and ACQUIRE ordering on success
vs. fully ordered atomic counterpart
Suggested-by: Kees Cook <[email protected]>
Signed-off-by: Elena Reshetova <[email protected]>
Reviewed-by: David Windsor <[email protected]>
Reviewed-by: Hans Liljestrand <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Kees Cook <[email protected]>
---
include/linux/cred.h | 7 ++++---
kernel/cred.c | 2 +-
kernel/groups.c | 2 +-
3 files changed, 6 insertions(+), 5 deletions(-)
diff --git a/include/linux/cred.h b/include/linux/cred.h
index 18639c069263..32be0daf5a32 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -12,6 +12,7 @@
#include <linux/init.h>
#include <linux/key.h>
#include <linux/atomic.h>
+#include <linux/refcount.h>
#include <linux/uidgid.h>
#include <linux/sched.h>
#include <linux/sched/user.h>
@@ -23,7 +24,7 @@ struct inode;
* COW Supplementary groups list
*/
struct group_info {
- atomic_t usage;
+ refcount_t usage;
int ngroups;
kgid_t gid[0];
} __randomize_layout;
@@ -39,7 +40,7 @@ struct group_info {
*/
static inline struct group_info *get_group_info(struct group_info *gi)
{
- atomic_inc(&gi->usage);
+ refcount_inc(&gi->usage);
return gi;
}
@@ -49,7 +50,7 @@ static inline struct group_info *get_group_info(struct group_info *gi)
*/
#define put_group_info(group_info) \
do { \
- if (atomic_dec_and_test(&(group_info)->usage)) \
+ if (refcount_dec_and_test(&(group_info)->usage)) \
groups_free(group_info); \
} while (0)
diff --git a/kernel/cred.c b/kernel/cred.c
index 421b1149c651..ed95117d07c4 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -33,7 +33,7 @@ do { \
static struct kmem_cache *cred_jar;
/* init to 2 - one for init_task, one to ensure it is never freed */
-struct group_info init_groups = { .usage = ATOMIC_INIT(2) };
+struct group_info init_groups = { .usage = REFCOUNT_INIT(2) };
/*
* The initial credentials for the initial task
diff --git a/kernel/groups.c b/kernel/groups.c
index 6ee6691f6839..22f2892dbaf3 100644
--- a/kernel/groups.c
+++ b/kernel/groups.c
@@ -24,7 +24,7 @@ struct group_info *groups_alloc(int gidsetsize)
if (!gi)
return NULL;
- atomic_set(&gi->usage, 1);
+ refcount_set(&gi->usage, 1);
gi->ngroups = gidsetsize;
return gi;
}
--
2.25.1
On 2020/6/13 2:34, Kees Cook wrote:
> From: Elena Reshetova <[email protected]>
>
> atomic_t variables are currently used to implement reference
> counters with the following properties:
> - counter is initialized to 1 using atomic_set()
> - a resource is freed upon counter reaching zero
> - once counter reaches zero, its further
> increments aren't allowed
> - counter schema uses basic atomic operations
> (set, inc, inc_not_zero, dec_and_test, etc.)
>
> Such atomic variables should be converted to a newly provided
> refcount_t type and API that prevents accidental counter overflows
> and underflows. This is important since overflows and underflows
> can lead to use-after-free situation and be exploitable.
>
> The variable cred.usage is used as pure reference counter.
> Convert it to refcount_t and fix up the operations.
>
> **Important note for maintainers:
>
> Some functions from refcount_t API defined in lib/refcount.c
> have different memory ordering guarantees than their atomic
> counterparts.Please check Documentation/core-api/refcount-vs-atomic.rst
> for more information.
>
> Normally the differences should not matter since refcount_t provides
> enough guarantees to satisfy the refcounting use cases, but in
> some rare cases it might matter.
> Please double check that you don't have some undocumented
> memory guarantees for this variable usage.
>
> For the cred.usage it might make a difference
> in following places:
> - get_task_cred(): increment in refcount_inc_not_zero() only
> guarantees control dependency on success vs. fully ordered
> atomic counterpart
> - put_cred(): decrement in refcount_dec_and_test() only
> provides RELEASE ordering and ACQUIRE ordering on success
> vs. fully ordered atomic counterpart
>
> Suggested-by: Kees Cook <[email protected]>
> Signed-off-by: Elena Reshetova <[email protected]>
> Reviewed-by: David Windsor <[email protected]>
> Reviewed-by: Hans Liljestrand <[email protected]>
> Link: https://lore.kernel.org/r/[email protected]
> Signed-off-by: Kees Cook <[email protected]>
Currently this patch is better than my RFC patch
Looks good to me.
Thanks
Xiaoming Ni
On 2020/6/13 2:34, Kees Cook wrote:
> This series was never applied[1], and was recently pointed out as
> missing[2]. If someone has a tree for this, please take it. Otherwise,
> please Ack and I'll send it to Linus.
>
> Thanks!
>
> -Kees
>
> [1] https://lore.kernel.org/lkml/[email protected]/
> [2] https://lore.kernel.org/lkml/[email protected]/
>
> Elena Reshetova (3):
> nsproxy: convert nsproxy.count to refcount_t
> groups: convert group_info.usage to refcount_t
> creds: convert cred.usage to refcount_t
>
> include/linux/cred.h | 15 +++++++-------
> include/linux/nsproxy.h | 7 +++----
> kernel/cred.c | 44 ++++++++++++++++++++---------------------
> kernel/groups.c | 2 +-
> kernel/nsproxy.c | 6 +++---
> net/sunrpc/auth.c | 2 +-
> 6 files changed, 38 insertions(+), 38 deletions(-)
>
Should mm->mm_users also be replaced by refcount_t?
In addition, is it better to change all variables that use
atomic_dec_and_test to control the release process to refconut_t?
Thanks
Xiaoming Ni
Greeting,
FYI, we noticed a 4.3% improvement of will-it-scale.per_process_ops due to commit:
commit: 67467ae14130847791f230fbc9f261d0c819b9c3 ("[PATCH 2/3] groups: convert group_info.usage to refcount_t")
url: https://github.com/0day-ci/linux/commits/Kees-Cook/Convert-nsproxy-groups-and-creds-to-refcount_t/20200613-023706
base: git://git.linux-nfs.org/projects/trondmy/linux-nfs.git linux-next
in testcase: will-it-scale
on test machine: 104 threads Skylake with 192G memory
with following parameters:
nr_task: 100%
mode: process
test: poll2
cpufreq_governor: performance
ucode: 0x2000065
test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
test-url: https://github.com/antonblanchard/will-it-scale
Details are as below:
-------------------------------------------------------------------------------------------------->
To reproduce:
git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp install job.yaml # job file is attached in this email
bin/lkp run job.yaml
=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase/ucode:
gcc-9/performance/x86_64-rhel-7.6/process/100%/debian-x86_64-20191114.cgz/lkp-skl-fpga01/poll2/will-it-scale/0x2000065
commit:
bcaef9d22e ("nsproxy: convert nsproxy.count to refcount_t")
67467ae141 ("groups: convert group_info.usage to refcount_t")
bcaef9d22e69accf 67467ae14130847791f230fbc9f
---------------- ---------------------------
%stddev %change %stddev
\ | \
205986 +4.3% 214828 will-it-scale.per_process_ops
21422614 +4.3% 22342250 will-it-scale.workload
6978 ± 50% +99.5% 13922 ± 26% numa-meminfo.node0.Inactive
6819 ± 52% +101.5% 13739 ± 26% numa-meminfo.node0.Inactive(anon)
8206 ± 46% +85.9% 15258 ± 23% numa-meminfo.node0.Shmem
21268 ± 16% -32.4% 14373 ± 25% numa-meminfo.node1.Inactive
21078 ± 16% -32.7% 14195 ± 25% numa-meminfo.node1.Inactive(anon)
1704 ± 52% +101.5% 3434 ± 26% numa-vmstat.node0.nr_inactive_anon
2051 ± 46% +85.9% 3813 ± 23% numa-vmstat.node0.nr_shmem
1704 ± 52% +101.5% 3434 ± 26% numa-vmstat.node0.nr_zone_inactive_anon
5270 ± 16% -32.7% 3549 ± 25% numa-vmstat.node1.nr_inactive_anon
5270 ± 16% -32.7% 3549 ± 25% numa-vmstat.node1.nr_zone_inactive_anon
6.22 ± 2% +11.5% 6.94 ± 6% sched_debug.cfs_rq:/.nr_spread_over.stddev
-359667 -32.2% -243708 sched_debug.cfs_rq:/.spread0.min
739.75 ± 6% +13.1% 836.96 sched_debug.cfs_rq:/.util_avg.min
67.85 -20.0% 54.29 ± 5% sched_debug.cfs_rq:/.util_avg.stddev
0.15 ± 5% +14.7% 0.17 ± 7% sched_debug.cpu.nr_running.stddev
450.25 ± 41% +177.8% 1250 ± 25% interrupts.39:PCI-MSI.67633154-edge.eth0-TxRx-1
876.00 ± 9% +54.8% 1356 ± 21% interrupts.CPU26.RES:Rescheduling_interrupts
450.25 ± 41% +177.8% 1250 ± 25% interrupts.CPU31.39:PCI-MSI.67633154-edge.eth0-TxRx-1
5403 ± 27% +40.9% 7615 ± 5% interrupts.CPU49.NMI:Non-maskable_interrupts
5403 ± 27% +40.9% 7615 ± 5% interrupts.CPU49.PMI:Performance_monitoring_interrupts
6577 ± 11% -35.1% 4267 ± 14% interrupts.CPU54.RES:Rescheduling_interrupts
358.00 ± 20% +91.8% 686.75 ± 70% interrupts.CPU96.RES:Rescheduling_interrupts
4.835e+10 +4.2% 5.04e+10 perf-stat.i.branch-instructions
0.31 -0.0 0.30 perf-stat.i.branch-miss-rate%
1.407e+08 +1.8% 1.432e+08 perf-stat.i.branch-misses
6.49 ± 9% +1.1 7.63 ± 5% perf-stat.i.cache-miss-rate%
397271 ± 5% +24.2% 493463 ± 5% perf-stat.i.cache-misses
1.18 -4.2% 1.13 perf-stat.i.cpi
836738 ± 7% -22.3% 650325 ± 4% perf-stat.i.cycles-between-cache-misses
21389813 +4.3% 22319467 perf-stat.i.dTLB-load-misses
5.514e+10 +4.3% 5.753e+10 perf-stat.i.dTLB-loads
2.535e+10 +4.4% 2.646e+10 perf-stat.i.dTLB-stores
21063107 +4.7% 22042637 perf-stat.i.iTLB-load-misses
2.39e+11 +4.2% 2.49e+11 perf-stat.i.instructions
0.85 +4.3% 0.89 perf-stat.i.ipc
1.19 +2.7% 1.22 perf-stat.i.metric.K/sec
1238 +4.3% 1292 perf-stat.i.metric.M/sec
88617 +4.6% 92673 perf-stat.i.node-load-misses
16016 ± 8% +11.5% 17852 ± 5% perf-stat.i.node-loads
0.29 -0.0 0.28 perf-stat.overall.branch-miss-rate%
6.75 ± 8% +1.1 7.83 ± 5% perf-stat.overall.cache-miss-rate%
1.18 -4.2% 1.13 perf-stat.overall.cpi
708307 ± 5% -19.6% 569690 ± 5% perf-stat.overall.cycles-between-cache-misses
0.85 +4.4% 0.89 perf-stat.overall.ipc
4.819e+10 +4.2% 5.023e+10 perf-stat.ps.branch-instructions
1.402e+08 +1.8% 1.427e+08 perf-stat.ps.branch-misses
397188 ± 5% +24.1% 492884 ± 5% perf-stat.ps.cache-misses
21318083 +4.3% 22244871 perf-stat.ps.dTLB-load-misses
5.495e+10 +4.3% 5.734e+10 perf-stat.ps.dTLB-loads
2.526e+10 +4.4% 2.637e+10 perf-stat.ps.dTLB-stores
20991781 +4.7% 21968503 perf-stat.ps.iTLB-load-misses
2.382e+11 +4.2% 2.482e+11 perf-stat.ps.instructions
88329 +4.6% 92369 perf-stat.ps.node-load-misses
16250 ± 7% +11.0% 18033 ± 5% perf-stat.ps.node-loads
7.197e+13 +4.3% 7.507e+13 perf-stat.total.instructions
18.52 -3.2 15.28 perf-profile.calltrace.cycles-pp.__fget_light.do_sys_poll.__x64_sys_poll.do_syscall_64.entry_SYSCALL_64_after_hwframe
2.75 -0.2 2.57 ± 3% perf-profile.calltrace.cycles-pp._copy_from_user.do_sys_poll.__x64_sys_poll.do_syscall_64.entry_SYSCALL_64_after_hwframe
2.85 -0.2 2.69 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64.poll
1.27 -0.1 1.17 ± 2% perf-profile.calltrace.cycles-pp.__check_object_size.do_sys_poll.__x64_sys_poll.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.81 ± 2% -0.1 0.75 perf-profile.calltrace.cycles-pp.__kmalloc.do_sys_poll.__x64_sys_poll.do_syscall_64.entry_SYSCALL_64_after_hwframe
89.61 +0.2 89.80 perf-profile.calltrace.cycles-pp.__x64_sys_poll.do_syscall_64.entry_SYSCALL_64_after_hwframe.poll
93.97 +0.2 94.16 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.poll
93.70 +0.2 93.94 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.poll
2.30 +0.5 2.81 perf-profile.calltrace.cycles-pp.__fdget.do_sys_poll.__x64_sys_poll.do_syscall_64.entry_SYSCALL_64_after_hwframe
88.02 +0.8 88.84 perf-profile.calltrace.cycles-pp.do_sys_poll.__x64_sys_poll.do_syscall_64.entry_SYSCALL_64_after_hwframe.poll
17.85 -2.9 14.92 perf-profile.children.cycles-pp.__fget_light
2.79 -0.2 2.60 ± 3% perf-profile.children.cycles-pp._copy_from_user
2.85 -0.2 2.70 perf-profile.children.cycles-pp.entry_SYSCALL_64
1.33 -0.1 1.21 ± 2% perf-profile.children.cycles-pp.__check_object_size
0.58 -0.1 0.51 ± 2% perf-profile.children.cycles-pp.__might_fault
0.87 -0.1 0.81 perf-profile.children.cycles-pp.__kmalloc
0.37 ± 3% -0.1 0.31 ± 3% perf-profile.children.cycles-pp.___might_sleep
0.12 ± 3% -0.0 0.10 perf-profile.children.cycles-pp.check_stack_object
89.63 +0.2 89.81 perf-profile.children.cycles-pp.__x64_sys_poll
94.00 +0.2 94.20 perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
93.75 +0.2 93.98 perf-profile.children.cycles-pp.do_syscall_64
89.12 +0.2 89.36 perf-profile.children.cycles-pp.do_sys_poll
2.29 +0.5 2.75 perf-profile.children.cycles-pp.__fdget
16.65 -3.1 13.54 perf-profile.self.cycles-pp.__fget_light
2.50 -0.2 2.33 perf-profile.self.cycles-pp.entry_SYSCALL_64
0.38 ± 2% -0.1 0.30 ± 2% perf-profile.self.cycles-pp.__check_object_size
0.36 ± 2% -0.1 0.30 ± 3% perf-profile.self.cycles-pp.___might_sleep
0.52 -0.0 0.47 perf-profile.self.cycles-pp.poll
0.44 -0.0 0.40 perf-profile.self.cycles-pp.__kmalloc
0.41 -0.0 0.37 ± 2% perf-profile.self.cycles-pp.__x64_sys_poll
0.26 -0.0 0.22 ± 3% perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe
0.17 ± 4% -0.0 0.15 perf-profile.self.cycles-pp.__might_fault
0.08 ± 5% -0.0 0.07 ± 7% perf-profile.self.cycles-pp.poll_select_set_timeout
0.11 ± 4% -0.0 0.09 ± 4% perf-profile.self.cycles-pp.check_stack_object
0.09 +0.0 0.10 perf-profile.self.cycles-pp.poll_freewait
4.02 +0.1 4.07 perf-profile.self.cycles-pp.do_syscall_64
1.16 +0.2 1.37 perf-profile.self.cycles-pp.__fdget
65.19 +3.5 68.64 perf-profile.self.cycles-pp.do_sys_poll
will-it-scale.per_process_ops
218000 +------------------------------------------------------------------+
| |
216000 |-+ O O O O O O O O O |
| O O O O O O O O O O |
214000 |-+ |
| |
212000 |-+O O O O O O |
| |
210000 |-+ |
| |
208000 |-+ .+..+. .+..+.. +..|
| .+..+. .+.+. +..+ .+.+.. .+.+.. : |
206000 |..+ +. +.+. +.+..+. +. : |
| +..+ |
204000 +------------------------------------------------------------------+
[*] bisect-good sample
[O] bisect-bad sample
Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.
Thanks,
Rong Chen
On Mon, Jun 15, 2020 at 10:10:08AM +0800, Xiaoming Ni wrote:
> On 2020/6/13 2:34, Kees Cook wrote:
> > This series was never applied[1], and was recently pointed out as
> > missing[2]. If someone has a tree for this, please take it. Otherwise,
> > please Ack and I'll send it to Linus.
> >
> > Thanks!
> >
> > -Kees
> >
> > [1] https://lore.kernel.org/lkml/[email protected]/
> > [2] https://lore.kernel.org/lkml/[email protected]/
> >
> > Elena Reshetova (3):
> > nsproxy: convert nsproxy.count to refcount_t
> > groups: convert group_info.usage to refcount_t
> > creds: convert cred.usage to refcount_t
> >
> > include/linux/cred.h | 15 +++++++-------
> > include/linux/nsproxy.h | 7 +++----
> > kernel/cred.c | 44 ++++++++++++++++++++---------------------
> > kernel/groups.c | 2 +-
> > kernel/nsproxy.c | 6 +++---
> > net/sunrpc/auth.c | 2 +-
> > 6 files changed, 38 insertions(+), 38 deletions(-)
> >
>
> Should mm->mm_users also be replaced by refcount_t?
I'll say "yes". :)
https://lore.kernel.org/lkml/[email protected]/
> In addition, is it better to change all variables that use
> atomic_dec_and_test to control the release process to refconut_t?
For the most part, yes. The following may find a lot of them:
scripts/coccinelle/api/atomic_as_refcounter.cocci
If you can go through that and double check for prior series from Elena,
we can get through all the rest of them.
Thanks for bringing this topic back up!
--
Kees Cook
> On Mon, Jun 15, 2020 at 10:10:08AM +0800, Xiaoming Ni wrote:
> > On 2020/6/13 2:34, Kees Cook wrote:
> > > This series was never applied[1], and was recently pointed out as
> > > missing[2]. If someone has a tree for this, please take it. Otherwise,
> > > please Ack and I'll send it to Linus.
> > >
> > > Thanks!
> > >
> > > -Kees
> > >
> > > [1] https://lore.kernel.org/lkml/20190306110549.7628-1-
> [email protected]/
> > > [2] https://lore.kernel.org/lkml/1591957695-118312-1-git-send-email-
> [email protected]/
> > >
> > > Elena Reshetova (3):
> > > nsproxy: convert nsproxy.count to refcount_t
> > > groups: convert group_info.usage to refcount_t
> > > creds: convert cred.usage to refcount_t
> > >
> > > include/linux/cred.h | 15 +++++++-------
> > > include/linux/nsproxy.h | 7 +++----
> > > kernel/cred.c | 44 ++++++++++++++++++++---------------------
> > > kernel/groups.c | 2 +-
> > > kernel/nsproxy.c | 6 +++---
> > > net/sunrpc/auth.c | 2 +-
> > > 6 files changed, 38 insertions(+), 38 deletions(-)
> > >
> >
> > Should mm->mm_users also be replaced by refcount_t?
>
> I'll say "yes". :)
> https://lore.kernel.org/lkml/1487671124-11188-1-git-send-email-
> [email protected]/
>
> > In addition, is it better to change all variables that use
> > atomic_dec_and_test to control the release process to refconut_t?
>
> For the most part, yes. The following may find a lot of them:
> scripts/coccinelle/api/atomic_as_refcounter.cocci
>
> If you can go through that and double check for prior series from Elena,
> we can get through all the rest of them.
I have these ones also here in case anyone is interested:
https://github.com/ereshetova/linux-stable/commits/refcount_t_fs
https://github.com/ereshetova/linux-stable/commits/refcount_t_block
They haven't been rebased for a while, but if there is an interest,
I can certainly do it.
Best Regards,
Elena.
On Mon, Jun 15, 2020 at 05:35:58AM +0000, Reshetova, Elena wrote:
> > On Mon, Jun 15, 2020 at 10:10:08AM +0800, Xiaoming Ni wrote:
> > > On 2020/6/13 2:34, Kees Cook wrote:
> > > > This series was never applied[1], and was recently pointed out as
> > > > missing[2]. If someone has a tree for this, please take it. Otherwise,
> > > > please Ack and I'll send it to Linus.
> > > >
> > > > Thanks!
> > > >
> > > > -Kees
> > > >
> > > > [1] https://lore.kernel.org/lkml/20190306110549.7628-1-
> > [email protected]/
> > > > [2] https://lore.kernel.org/lkml/1591957695-118312-1-git-send-email-
> > [email protected]/
> > > >
> > > > Elena Reshetova (3):
> > > > nsproxy: convert nsproxy.count to refcount_t
> > > > groups: convert group_info.usage to refcount_t
> > > > creds: convert cred.usage to refcount_t
> > > >
> > > > include/linux/cred.h | 15 +++++++-------
> > > > include/linux/nsproxy.h | 7 +++----
> > > > kernel/cred.c | 44 ++++++++++++++++++++---------------------
> > > > kernel/groups.c | 2 +-
> > > > kernel/nsproxy.c | 6 +++---
> > > > net/sunrpc/auth.c | 2 +-
> > > > 6 files changed, 38 insertions(+), 38 deletions(-)
> > > >
> > >
> > > Should mm->mm_users also be replaced by refcount_t?
> >
> > I'll say "yes". :)
> > https://lore.kernel.org/lkml/1487671124-11188-1-git-send-email-
> > [email protected]/
> >
> > > In addition, is it better to change all variables that use
> > > atomic_dec_and_test to control the release process to refconut_t?
> >
> > For the most part, yes. The following may find a lot of them:
> > scripts/coccinelle/api/atomic_as_refcounter.cocci
> >
> > If you can go through that and double check for prior series from Elena,
> > we can get through all the rest of them.
>
> I have these ones also here in case anyone is interested:
> https://github.com/ereshetova/linux-stable/commits/refcount_t_fs
> https://github.com/ereshetova/linux-stable/commits/refcount_t_block
>
> They haven't been rebased for a while, but if there is an interest,
> I can certainly do it.
That would be great; yes!
--
Kees Cook
> > I have these ones also here in case anyone is interested:
> > https://github.com/ereshetova/linux-stable/commits/refcount_t_fs
> > https://github.com/ereshetova/linux-stable/commits/refcount_t_block
> >
> > They haven't been rebased for a while, but if there is an interest,
> > I can certainly do it.
>
> That would be great; yes!
Here are leftover pieces now rebased on top of next/stable:
https://github.com/ereshetova/linux-stable/commits/refcount_t_fs
https://github.com/ereshetova/linux-stable/commits/refcount_t_block
Haven't tested them beyond allyesconfig compile through...
There are also mm pieces (ignore bdi_writeback_congested.refcnt as it is already done)
that were sent originally here:
https://lore.kernel.org/lkml/[email protected]/
But especially mm_struct pieces would need a new look and proper
checks.
Again, happy to do them also on a separate branch, if there is an interest to merge
them (I expect mm_struct to be much more sensitive structure for such changes).
Best Regards,
Elena
Kees Cook <[email protected]> wrote:
> > Should mm->mm_users also be replaced by refcount_t?
>
> I'll say "yes". :)
> https://lore.kernel.org/lkml/[email protected]/
>
> > In addition, is it better to change all variables that use
> > atomic_dec_and_test to control the release process to refconut_t?
>
> For the most part, yes. The following may find a lot of them:
> scripts/coccinelle/api/atomic_as_refcounter.cocci
I've been gradually undoing some of the conversions as there's no equivalent
of atomic_add_return() and atomic_dec_return() that allow me to log the
altered refcount through a tracepoint.
David
> https://github.com/ereshetova/linux-stable/commits/refcount_t_fs
Looking at "fs, cachefiles: convert cachefiles_object.usage from atomic_t to
refcount_t", I see:
- u = atomic_inc_return(&object->usage);
+ refcount_inc(&object->usage);
trace_cachefiles_ref(object, _object->cookie,
- (enum cachefiles_obj_ref_trace)why, u);
+ (enum cachefiles_obj_ref_trace)why, refcount_read(&object->usage));
return &object->fscache;
This change is *not* equivalent. There's a reason I'm using
atomic_inc_return() and not atomic_inc(),atomic_read(). Yes, the small window
*does* occasionally produce incorrect tracing, and, yes, when that happens it
does make things confusing.
- u = atomic_dec_return(&object->usage);
trace_cachefiles_ref(object, _object->cookie,
- (enum cachefiles_obj_ref_trace)why, u);
- ASSERTCMP(u, !=, -1);
- if (u == 0) {
+ (enum cachefiles_obj_ref_trace)why, refcount_read(&object->usage) - 1);
+ if (refcount_dec_and_test(&object->usage)) {
This is also not equivalent. Again, there's a reason I'm using
atomic_dec_return() and not atomic_read(),atomic_dec_and_test().
So, please drop the cachefiles/fscache patches or use refcount_inc_return()
and refcount_dec_return().
Another reason to please drop these patches is that they will cause my
fscache-iter branch to bounce. A lot of this code is deleted/heavily
rewritten:
https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=fscache-iter
Thanks,
David
On Tue, Jul 21, 2020 at 11:51:04AM +0100, David Howells wrote:
> Kees Cook <[email protected]> wrote:
>
> > > Should mm->mm_users also be replaced by refcount_t?
> >
> > I'll say "yes". :)
> > https://lore.kernel.org/lkml/[email protected]/
> >
> > > In addition, is it better to change all variables that use
> > > atomic_dec_and_test to control the release process to refconut_t?
> >
> > For the most part, yes. The following may find a lot of them:
> > scripts/coccinelle/api/atomic_as_refcounter.cocci
>
> I've been gradually undoing some of the conversions as there's no equivalent
> of atomic_add_return() and atomic_dec_return() that allow me to log the
> altered refcount through a tracepoint.
Please do not _undo_ the changes; just add the API you need.
--
Kees Cook
On Tue, Jul 21, 2020 at 11:44:53AM -0700, Kees Cook wrote:
> On Tue, Jul 21, 2020 at 11:51:04AM +0100, David Howells wrote:
> > Kees Cook <[email protected]> wrote:
> >
> > > > Should mm->mm_users also be replaced by refcount_t?
> > >
> > > I'll say "yes". :)
> > > https://lore.kernel.org/lkml/[email protected]/
> > >
> > > > In addition, is it better to change all variables that use
> > > > atomic_dec_and_test to control the release process to refconut_t?
> > >
> > > For the most part, yes. The following may find a lot of them:
> > > scripts/coccinelle/api/atomic_as_refcounter.cocci
> >
> > I've been gradually undoing some of the conversions as there's no equivalent
> > of atomic_add_return() and atomic_dec_return() that allow me to log the
> > altered refcount through a tracepoint.
>
> Please do not _undo_ the changes; just add the API you need.
add_return and sub_return are horrible interface for refcount, which is
the problem.
If you meant: refcount_dec(), but want the old value for tracing, you
want a different ordering than if you wanted to do
refcount_dec_and_test(); dec_return can't know this.
David, would something like a __refcount_*() API work where there is a
3rd argument (int *), which, if !NULL, will be assigned the old value?
Peter Zijlstra <[email protected]> wrote:
> > Please do not _undo_ the changes; just add the API you need.
>
> add_return and sub_return are horrible interface for refcount, which is
> the problem.
>
> If you meant: refcount_dec(), but want the old value for tracing, you
> want a different ordering than if you wanted to do
> refcount_dec_and_test(); dec_return can't know this.
>
> David, would something like a __refcount_*() API work where there is a
> 3rd argument (int *), which, if !NULL, will be assigned the old value?
That would be fine, though the number needs to be something I can interpret
easily when looking through the traces. It would also be okay if there was an
interpretation function that I could use in the trace point when setting the
variable.
Say:
void rxrpc_get_call(struct rxrpc_call *call, enum rxrpc_call_trace op)
{
const void *here = __builtin_return_address(0);
unsigned int n;
refcount_inc_return(&call->usage, &n);
trace_rxrpc_call(call->debug_id, op, n, here, NULL);
}
then:
TRACE_EVENT(rxrpc_call,
TP_PROTO(..., int usage, ...),
TP_ARGS(...),
TP_STRUCT__entry(
...
__field(int, usage)
...
),
TP_fast_assign(
...
__entry->usage = refcount_interpret(usage);
...
),
TP_printk("... u=%d ...",
...
__entry->usage,
...)
);
so that it looks like the refcount is 'complete' at 0.
David
Kees Cook <[email protected]> wrote:
> > I've been gradually undoing some of the conversions as there's no equivalent
> > of atomic_add_return() and atomic_dec_return() that allow me to log the
> > altered refcount through a tracepoint.
>
> Please do not _undo_ the changes; just add the API you need.
And _please_ do not convert atomic_inc/dec_return() into refcount_read() +
refcount_xxx(). They are _not_ equivalent.
David
On Tue, Jul 28, 2020 at 11:56:38AM +0100, David Howells wrote:
> Peter Zijlstra <[email protected]> wrote:
>
> > > Please do not _undo_ the changes; just add the API you need.
> >
> > add_return and sub_return are horrible interface for refcount, which is
> > the problem.
> >
> > If you meant: refcount_dec(), but want the old value for tracing, you
> > want a different ordering than if you wanted to do
> > refcount_dec_and_test(); dec_return can't know this.
> >
> > David, would something like a __refcount_*() API work where there is a
> > 3rd argument (int *), which, if !NULL, will be assigned the old value?
>
> That would be fine, though the number needs to be something I can interpret
> easily when looking through the traces. It would also be okay if there was an
> interpretation function that I could use in the trace point when setting the
> variable.
I'm not entirely sure what you mean with interpret, provided you don't
trigger a refcount fail, the number will be just what you expect and
would get from refcount_read(). If you do trigger a fail, you'll get a
negative value.
How's the below? I didn't provide __refcount versions for the external
functions, I suppose that can be done too, but wondered if you really
needed those.
---
Subject: locking/refcount: Provide __refcount API to obtain the old value
From: Peter Zijlstra <[email protected]>
Date: Wed Jul 29 13:00:57 CEST 2020
David requested means to obtain the old/previous value from the
refcount API for tracing purposes.
Duplicate (most of) the API as __refcount*() with an additional
'int *' argument into which, if !NULL, the old value will be stored.
Requested-by: David Howells <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
include/linux/refcount.h | 65 +++++++++++++++++++++++++++++++++++++++++------
1 file changed, 57 insertions(+), 8 deletions(-)
--- a/include/linux/refcount.h
+++ b/include/linux/refcount.h
@@ -165,7 +165,7 @@ static inline unsigned int refcount_read
*
* Return: false if the passed refcount is 0, true otherwise
*/
-static inline __must_check bool refcount_add_not_zero(int i, refcount_t *r)
+static inline __must_check bool __refcount_add_not_zero(int i, refcount_t *r, int *oldp)
{
int old = refcount_read(r);
@@ -174,12 +174,20 @@ static inline __must_check bool refcount
break;
} while (!atomic_try_cmpxchg_relaxed(&r->refs, &old, old + i));
+ if (oldp)
+ *oldp = old;
+
if (unlikely(old < 0 || old + i < 0))
refcount_warn_saturate(r, REFCOUNT_ADD_NOT_ZERO_OVF);
return old;
}
+static inline __must_check bool refcount_add_not_zero(int i, refcount_t *r)
+{
+ return __refcount_add_not_zero(i, r, NULL);
+}
+
/**
* refcount_add - add a value to a refcount
* @i: the value to add to the refcount
@@ -196,16 +204,24 @@ static inline __must_check bool refcount
* cases, refcount_inc(), or one of its variants, should instead be used to
* increment a reference count.
*/
-static inline void refcount_add(int i, refcount_t *r)
+static inline void __refcount_add(int i, refcount_t *r, int *oldp)
{
int old = atomic_fetch_add_relaxed(i, &r->refs);
+ if (oldp)
+ *oldp = old;
+
if (unlikely(!old))
refcount_warn_saturate(r, REFCOUNT_ADD_UAF);
else if (unlikely(old < 0 || old + i < 0))
refcount_warn_saturate(r, REFCOUNT_ADD_OVF);
}
+static inline void refcount_add(int i, refcount_t *r)
+{
+ __refcount_add(i, r, NULL);
+}
+
/**
* refcount_inc_not_zero - increment a refcount unless it is 0
* @r: the refcount to increment
@@ -219,9 +235,14 @@ static inline void refcount_add(int i, r
*
* Return: true if the increment was successful, false otherwise
*/
+static inline __must_check bool __refcount_inc_not_zero(refcount_t *r, int *oldp)
+{
+ return __refcount_add_not_zero(1, r, oldp);
+}
+
static inline __must_check bool refcount_inc_not_zero(refcount_t *r)
{
- return refcount_add_not_zero(1, r);
+ return __refcount_inc_not_zero(r, NULL);
}
/**
@@ -236,9 +257,14 @@ static inline __must_check bool refcount
* Will WARN if the refcount is 0, as this represents a possible use-after-free
* condition.
*/
+static inline void __refcount_inc(refcount_t *r, int *oldp)
+{
+ __refcount_add(1, r, oldp);
+}
+
static inline void refcount_inc(refcount_t *r)
{
- refcount_add(1, r);
+ __refcount_inc(r, NULL);
}
/**
@@ -261,10 +287,13 @@ static inline void refcount_inc(refcount
*
* Return: true if the resulting refcount is 0, false otherwise
*/
-static inline __must_check bool refcount_sub_and_test(int i, refcount_t *r)
+static inline __must_check bool __refcount_sub_and_test(int i, refcount_t *r, int *oldp)
{
int old = atomic_fetch_sub_release(i, &r->refs);
+ if (oldp)
+ *oldp = old;
+
if (old == i) {
smp_acquire__after_ctrl_dep();
return true;
@@ -276,6 +305,11 @@ static inline __must_check bool refcount
return false;
}
+static inline __must_check bool refcount_sub_and_test(int i, refcount_t *r)
+{
+ return __refcount_sub_and_test(i, r, NULL);
+}
+
/**
* refcount_dec_and_test - decrement a refcount and test if it is 0
* @r: the refcount
@@ -289,9 +323,14 @@ static inline __must_check bool refcount
*
* Return: true if the resulting refcount is 0, false otherwise
*/
+static inline __must_check bool __refcount_dec_and_test(refcount_t *r, int *oldp)
+{
+ return __refcount_sub_and_test(1, r, oldp);
+}
+
static inline __must_check bool refcount_dec_and_test(refcount_t *r)
{
- return refcount_sub_and_test(1, r);
+ return __refcount_dec_and_test(r, NULL);
}
/**
@@ -304,12 +343,22 @@ static inline __must_check bool refcount
* Provides release memory ordering, such that prior loads and stores are done
* before.
*/
-static inline void refcount_dec(refcount_t *r)
+static inline void __refcount_dec(refcount_t *r, int *oldp)
{
- if (unlikely(atomic_fetch_sub_release(1, &r->refs) <= 1))
+ int old = atomic_fetch_sub_release(1, &r->refs);
+
+ if (oldp)
+ *oldp = old;
+
+ if (unlikely(old <= 1))
refcount_warn_saturate(r, REFCOUNT_DEC_LEAK);
}
+static inline void refcount_dec(refcount_t *r)
+{
+ __refcount_dec(r, NULL);
+}
+
extern __must_check bool refcount_dec_if_one(refcount_t *r);
extern __must_check bool refcount_dec_not_one(refcount_t *r);
extern __must_check bool refcount_dec_and_mutex_lock(refcount_t *r, struct mutex *lock);
On Wed, Jul 29, 2020 at 01:11:20PM +0200, [email protected] wrote:
> Subject: locking/refcount: Provide __refcount API to obtain the old value
> From: Peter Zijlstra <[email protected]>
> Date: Wed Jul 29 13:00:57 CEST 2020
>
> David requested means to obtain the old/previous value from the
> refcount API for tracing purposes.
>
> Duplicate (most of) the API as __refcount*() with an additional
> 'int *' argument into which, if !NULL, the old value will be stored.
>
> Requested-by: David Howells <[email protected]>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> ---
> include/linux/refcount.h | 65 +++++++++++++++++++++++++++++++++++++++++------
> 1 file changed, 57 insertions(+), 8 deletions(-)
>
> --- a/include/linux/refcount.h
> +++ b/include/linux/refcount.h
> @@ -165,7 +165,7 @@ static inline unsigned int refcount_read
> *
> * Return: false if the passed refcount is 0, true otherwise
> */
> -static inline __must_check bool refcount_add_not_zero(int i, refcount_t *r)
> +static inline __must_check bool __refcount_add_not_zero(int i, refcount_t *r, int *oldp)
> {
> int old = refcount_read(r);
>
> @@ -174,12 +174,20 @@ static inline __must_check bool refcount
> break;
> } while (!atomic_try_cmpxchg_relaxed(&r->refs, &old, old + i));
>
> + if (oldp)
> + *oldp = old;
> +
> if (unlikely(old < 0 || old + i < 0))
> refcount_warn_saturate(r, REFCOUNT_ADD_NOT_ZERO_OVF);
>
> return old;
> }
>
> +static inline __must_check bool refcount_add_not_zero(int i, refcount_t *r)
> +{
> + return __refcount_add_not_zero(i, r, NULL);
> +}
so, I could also emulate C++'s
bool refcount_add_not_zero(int i, refcount_t *r, int *oldp = NULL)
style by going to town on this with a bunch of CPP magic, but I don't
think that'll actually make things clearer.
It'll look something like:
#define __REF_ARGS(_0, _1, _2, _3, _n, X...) _n
#define REG_ARGS(X...) __REF_ARGS(, ##X, 3, 2, 1, 0)
#define __REF_CONCAT(a, b) a ## b
#define REF_CONCAT(a, b) __REF_CONCAT(a, b)
#define REF_UNARY_2(func, arg1, arg2) func(arg1, arg2)
#define REF_UNARY_1(func, arg1) func(arg1, NULL)
#define REF_UNARY(func, X...) REF_CONCAT(REF_UNARY_, REF_ARGS(X))(func, X)
#define REF_BINARY_3(func, arg1, arg2, arg3) func(arg1, arg2, arg3)
#define REF_BINARY_2(func, arg1, arg2) func(arg1, arg2, NULL)
#define REF_BINARY(func, X...) REF_CONCAT(REF_BINARY_, REF_ARGS(X))(func, X)
#define refcount_add(X...) REF_BINARY(__refcount_add, X)
#define refcount_inc(X...) REF_UNARY(__refcount_inc, X)
Opinions?
On Wed, Jul 29, 2020 at 01:37:31PM +0200, [email protected] wrote:
> On Wed, Jul 29, 2020 at 01:11:20PM +0200, [email protected] wrote:
>
> > Subject: locking/refcount: Provide __refcount API to obtain the old value
> > From: Peter Zijlstra <[email protected]>
> > Date: Wed Jul 29 13:00:57 CEST 2020
> >
> > David requested means to obtain the old/previous value from the
> > refcount API for tracing purposes.
> >
> > Duplicate (most of) the API as __refcount*() with an additional
> > 'int *' argument into which, if !NULL, the old value will be stored.
> >
> > Requested-by: David Howells <[email protected]>
> > Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> > ---
> > include/linux/refcount.h | 65 +++++++++++++++++++++++++++++++++++++++++------
> > 1 file changed, 57 insertions(+), 8 deletions(-)
> >
> > --- a/include/linux/refcount.h
> > +++ b/include/linux/refcount.h
> > @@ -165,7 +165,7 @@ static inline unsigned int refcount_read
> > *
> > * Return: false if the passed refcount is 0, true otherwise
> > */
> > -static inline __must_check bool refcount_add_not_zero(int i, refcount_t *r)
> > +static inline __must_check bool __refcount_add_not_zero(int i, refcount_t *r, int *oldp)
> > {
> > int old = refcount_read(r);
> >
> > @@ -174,12 +174,20 @@ static inline __must_check bool refcount
> > break;
> > } while (!atomic_try_cmpxchg_relaxed(&r->refs, &old, old + i));
> >
> > + if (oldp)
> > + *oldp = old;
> > +
> > if (unlikely(old < 0 || old + i < 0))
> > refcount_warn_saturate(r, REFCOUNT_ADD_NOT_ZERO_OVF);
> >
> > return old;
> > }
> >
> > +static inline __must_check bool refcount_add_not_zero(int i, refcount_t *r)
> > +{
> > + return __refcount_add_not_zero(i, r, NULL);
> > +}
>
> so, I could also emulate C++'s
>
> bool refcount_add_not_zero(int i, refcount_t *r, int *oldp = NULL)
>
> style by going to town on this with a bunch of CPP magic, but I don't
> think that'll actually make things clearer.
Erg. No, I like the __-version better -- it looks more like other kernel
APIs.
--
Kees Cook
On Wed, Jul 29, 2020 at 01:11:20PM +0200, [email protected] wrote:
> Subject: locking/refcount: Provide __refcount API to obtain the old value
> From: Peter Zijlstra <[email protected]>
> Date: Wed Jul 29 13:00:57 CEST 2020
>
> David requested means to obtain the old/previous value from the
> refcount API for tracing purposes.
>
> Duplicate (most of) the API as __refcount*() with an additional
> 'int *' argument into which, if !NULL, the old value will be stored.
>
> Requested-by: David Howells <[email protected]>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
This looks good to me. Thanks for poking at it!
Reviewed-by: Kees Cook <[email protected]>
--
Kees Cook
[email protected] wrote:
> I'm not entirely sure what you mean with interpret, provided you don't
> trigger a refcount fail, the number will be just what you expect and
> would get from refcount_read(). If you do trigger a fail, you'll get a
> negative value.
That's fine. I seem to remember talk about the possibility that the number
wouldn't necessarily bottom out at zero - for instance if it was arranged such
that the overflow flag was set on an overflow or underflow so that it could be
trapped on (using INTO or TRAPV, for example).
David
On Wed, Jul 29, 2020 at 09:41:37PM +0100, David Howells wrote:
> [email protected] wrote:
>
> > I'm not entirely sure what you mean with interpret, provided you don't
> > trigger a refcount fail, the number will be just what you expect and
> > would get from refcount_read(). If you do trigger a fail, you'll get a
> > negative value.
>
> That's fine. I seem to remember talk about the possibility that the number
> wouldn't necessarily bottom out at zero - for instance if it was arranged such
> that the overflow flag was set on an overflow or underflow so that it could be
> trapped on (using INTO or TRAPV, for example).
The trap is an internal detail. The saturation value will be negative,
though.
--
Kees Cook
The following commit has been merged into the locking/core branch of tip:
Commit-ID: a435b9a14356587cf512ea6473368a579373c74c
Gitweb: https://git.kernel.org/tip/a435b9a14356587cf512ea6473368a579373c74c
Author: Peter Zijlstra <[email protected]>
AuthorDate: Wed, 29 Jul 2020 13:00:57 +02:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Wed, 26 Aug 2020 12:42:02 +02:00
locking/refcount: Provide __refcount API to obtain the old value
David requested means to obtain the old/previous value from the
refcount API for tracing purposes.
Duplicate (most of) the API as __refcount*() with an additional
'int *' argument into which, if !NULL, the old value will be stored.
Requested-by: David Howells <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
include/linux/refcount.h | 65 ++++++++++++++++++++++++++++++++++-----
1 file changed, 57 insertions(+), 8 deletions(-)
diff --git a/include/linux/refcount.h b/include/linux/refcount.h
index 0e3ee25..7fabb1a 100644
--- a/include/linux/refcount.h
+++ b/include/linux/refcount.h
@@ -165,7 +165,7 @@ static inline unsigned int refcount_read(const refcount_t *r)
*
* Return: false if the passed refcount is 0, true otherwise
*/
-static inline __must_check bool refcount_add_not_zero(int i, refcount_t *r)
+static inline __must_check bool __refcount_add_not_zero(int i, refcount_t *r, int *oldp)
{
int old = refcount_read(r);
@@ -174,12 +174,20 @@ static inline __must_check bool refcount_add_not_zero(int i, refcount_t *r)
break;
} while (!atomic_try_cmpxchg_relaxed(&r->refs, &old, old + i));
+ if (oldp)
+ *oldp = old;
+
if (unlikely(old < 0 || old + i < 0))
refcount_warn_saturate(r, REFCOUNT_ADD_NOT_ZERO_OVF);
return old;
}
+static inline __must_check bool refcount_add_not_zero(int i, refcount_t *r)
+{
+ return __refcount_add_not_zero(i, r, NULL);
+}
+
/**
* refcount_add - add a value to a refcount
* @i: the value to add to the refcount
@@ -196,16 +204,24 @@ static inline __must_check bool refcount_add_not_zero(int i, refcount_t *r)
* cases, refcount_inc(), or one of its variants, should instead be used to
* increment a reference count.
*/
-static inline void refcount_add(int i, refcount_t *r)
+static inline void __refcount_add(int i, refcount_t *r, int *oldp)
{
int old = atomic_fetch_add_relaxed(i, &r->refs);
+ if (oldp)
+ *oldp = old;
+
if (unlikely(!old))
refcount_warn_saturate(r, REFCOUNT_ADD_UAF);
else if (unlikely(old < 0 || old + i < 0))
refcount_warn_saturate(r, REFCOUNT_ADD_OVF);
}
+static inline void refcount_add(int i, refcount_t *r)
+{
+ __refcount_add(i, r, NULL);
+}
+
/**
* refcount_inc_not_zero - increment a refcount unless it is 0
* @r: the refcount to increment
@@ -219,9 +235,14 @@ static inline void refcount_add(int i, refcount_t *r)
*
* Return: true if the increment was successful, false otherwise
*/
+static inline __must_check bool __refcount_inc_not_zero(refcount_t *r, int *oldp)
+{
+ return __refcount_add_not_zero(1, r, oldp);
+}
+
static inline __must_check bool refcount_inc_not_zero(refcount_t *r)
{
- return refcount_add_not_zero(1, r);
+ return __refcount_inc_not_zero(r, NULL);
}
/**
@@ -236,9 +257,14 @@ static inline __must_check bool refcount_inc_not_zero(refcount_t *r)
* Will WARN if the refcount is 0, as this represents a possible use-after-free
* condition.
*/
+static inline void __refcount_inc(refcount_t *r, int *oldp)
+{
+ __refcount_add(1, r, oldp);
+}
+
static inline void refcount_inc(refcount_t *r)
{
- refcount_add(1, r);
+ __refcount_inc(r, NULL);
}
/**
@@ -261,10 +287,13 @@ static inline void refcount_inc(refcount_t *r)
*
* Return: true if the resulting refcount is 0, false otherwise
*/
-static inline __must_check bool refcount_sub_and_test(int i, refcount_t *r)
+static inline __must_check bool __refcount_sub_and_test(int i, refcount_t *r, int *oldp)
{
int old = atomic_fetch_sub_release(i, &r->refs);
+ if (oldp)
+ *oldp = old;
+
if (old == i) {
smp_acquire__after_ctrl_dep();
return true;
@@ -276,6 +305,11 @@ static inline __must_check bool refcount_sub_and_test(int i, refcount_t *r)
return false;
}
+static inline __must_check bool refcount_sub_and_test(int i, refcount_t *r)
+{
+ return __refcount_sub_and_test(i, r, NULL);
+}
+
/**
* refcount_dec_and_test - decrement a refcount and test if it is 0
* @r: the refcount
@@ -289,9 +323,14 @@ static inline __must_check bool refcount_sub_and_test(int i, refcount_t *r)
*
* Return: true if the resulting refcount is 0, false otherwise
*/
+static inline __must_check bool __refcount_dec_and_test(refcount_t *r, int *oldp)
+{
+ return __refcount_sub_and_test(1, r, oldp);
+}
+
static inline __must_check bool refcount_dec_and_test(refcount_t *r)
{
- return refcount_sub_and_test(1, r);
+ return __refcount_dec_and_test(r, NULL);
}
/**
@@ -304,12 +343,22 @@ static inline __must_check bool refcount_dec_and_test(refcount_t *r)
* Provides release memory ordering, such that prior loads and stores are done
* before.
*/
-static inline void refcount_dec(refcount_t *r)
+static inline void __refcount_dec(refcount_t *r, int *oldp)
{
- if (unlikely(atomic_fetch_sub_release(1, &r->refs) <= 1))
+ int old = atomic_fetch_sub_release(1, &r->refs);
+
+ if (oldp)
+ *oldp = old;
+
+ if (unlikely(old <= 1))
refcount_warn_saturate(r, REFCOUNT_DEC_LEAK);
}
+static inline void refcount_dec(refcount_t *r)
+{
+ __refcount_dec(r, NULL);
+}
+
extern __must_check bool refcount_dec_if_one(refcount_t *r);
extern __must_check bool refcount_dec_not_one(refcount_t *r);
extern __must_check bool refcount_dec_and_mutex_lock(refcount_t *r, struct mutex *lock);
On Wed, Jul 29, 2020 at 01:11:20PM +0200, [email protected] wrote:
> On Tue, Jul 28, 2020 at 11:56:38AM +0100, David Howells wrote:
> > Peter Zijlstra <[email protected]> wrote:
> >
> > > > Please do not _undo_ the changes; just add the API you need.
> > >
> > > add_return and sub_return are horrible interface for refcount, which is
> > > the problem.
> > >
> > > If you meant: refcount_dec(), but want the old value for tracing, you
> > > want a different ordering than if you wanted to do
> > > refcount_dec_and_test(); dec_return can't know this.
> > >
> > > David, would something like a __refcount_*() API work where there is a
> > > 3rd argument (int *), which, if !NULL, will be assigned the old value?
> >
> > That would be fine, though the number needs to be something I can interpret
> > easily when looking through the traces. It would also be okay if there was an
> > interpretation function that I could use in the trace point when setting the
> > variable.
>
> I'm not entirely sure what you mean with interpret, provided you don't
> trigger a refcount fail, the number will be just what you expect and
> would get from refcount_read(). If you do trigger a fail, you'll get a
> negative value.
>
> How's the below? I didn't provide __refcount versions for the external
> functions, I suppose that can be done too, but wondered if you really
> needed those.
It looks like this didn't go anywhere, but I'm supportive of the general
idea if it's useful to somebody. The only part I find a bit grotty is that
we end up introducing a bunch of redundancy in some of the new functions,
e.g.:
> @@ -219,9 +235,14 @@ static inline void refcount_add(int i, r
> *
> * Return: true if the increment was successful, false otherwise
> */
> +static inline __must_check bool __refcount_inc_not_zero(refcount_t *r, int *oldp)
> +{
> + return __refcount_add_not_zero(1, r, oldp);
> +}
Where returning both a bool to indicate whether the old value was zero
and also the old value itself is a bit OTT.
Will
Will Deacon <[email protected]> wrote:
> > @@ -219,9 +235,14 @@ static inline void refcount_add(int i, r
> > *
> > * Return: true if the increment was successful, false otherwise
> > */
> > +static inline __must_check bool __refcount_inc_not_zero(refcount_t *r, int *oldp)
> > +{
> > + return __refcount_add_not_zero(1, r, oldp);
> > +}
>
> Where returning both a bool to indicate whether the old value was zero
> and also the old value itself is a bit OTT.
Actually, with the i386 cmpxchg, that makes sense. You can use the Z flag to
give you the bool, saving on checking the old value.
David