2021-09-15 19:55:14

by Kalesh Singh

[permalink] [raw]
Subject: [PATCH 0/5] tracing: Extend histogram triggers expression parsing

The frequency of the rss_stat trace event is known to be of the same
magnitude as that of the sched_switch event on Android devices. This can
cause flooding of the trace buffer with rss_stat traces leading to a
decreased trace buffer capacity and loss of data.

If it is not necessary to monitor very small changes in rss (as is the
case in Android) then the rss_stat tracepoint can be throttled to only
emit the event once there is a large enough change in the rss size.
The original patch that introduced the rss_stat tracepoint also proposed
a fixed throttling mechanism that only emits the rss_stat event
when the rss size crosses a 512KB boundary. It was concluded that more
generic support for this type of filtering/throttling was need, so that
it can be applied to any trace event. [1]

From the discussion in [1], histogram triggers seemed the most likely
candidate to support this type of throttling. For instance to achieve the
same throttling as was proposed in [1]:

(1) Create a histogram variable to save the 512KB bucket of the rss size
(2) Use the onchange handler to generate a synthetic event when the
rss size bucket changes.

The only missing pieces to support such a hist trigger are:
(1) Support for setting a hist variable to a specific value -- to set
the bucket size / granularity.
(2) Support for division arithmetic operation -- to determine the
corresponding bucket for an rss size.

This series extends histogram trigger expressions to:
(1) Allow assigning numeric literals to hist variable (eg. x=1234)
and using literals directly in expressions (eg. x=size/1234)
(2) Support division and multiplication in hist expressions.
(eg. a=$x/$y*z); and
(3) Fixes expression parsing for non-associative operators: subtraction
and division. (eg. 8-4-2 should be 2 not 6)

The rss_stat event can then be throttled using histogram triggers as
below:

# Create a synthetic event to monitor instead of the high frequency
# rss_stat event
echo 'rss_stat_throttled unsigned int mm_id; unsigned int curr;
int member; long size' >> tracing/synthetic_events

# Create a hist trigger that emits the synthetic rss_stat_throttled
# event only when the rss size crosses a 512KB boundary.
echo 'hist:keys=common_pid:bucket=size/0x80000:onchange($bucket)
.rss_stat_throttled(mm_id,curr,member,size)'
>> events/kmem/rss_stat/trigger

------ Test Results ------
Histograms can also be used to evaluate the effectiveness of this
throttling by noting the Total Hits on each trigger:

echo 'hist:keys=common_pid' >> events/sched/sched_switch/trigger
echo 'hist:keys=common_pid' >> events/kmem/rss_stat/trigger
echo 'hist:keys=common_pid'
>> events/synthetic/rss_stat_throttled/trigger

Allowing the above example (512KB granularity) run for 5 minutes on
an arm64 device with 5.10 kernel:

sched_switch : total hits = 147153
rss_stat : total hits = 38863
rss_stat_throttled: total hits = 2409

The synthetic rss_stat_throttled event is ~16x less frequent than the
rss_stat event when using a 512KB granularity.


The results are more pronounced when rss size is changing at a higher
rate in small increments. For instance the following results were obtained
by recording the hits on the above events for a run of Android's
lmkd_unit_test [2], which continually forks processes that map anonymous
memory until there is an oom kill:

sched_switch : total hits = 148832
rss_stat : total hits = 4754802
rss_stat_throttled: total hits = 96214

In this stress this, the synthetic rss_stat_throttled event is ~50x less
frequent than the rss_stat event when using a 512KB granularity.


[1] https://lore.kernel.org/lkml/[email protected]/
[2] https://cs.android.com/android/platform/superproject/+/master:system/memory/lmkd/tests/lmkd_test.cpp

Signed-off-by: Kalesh Singh <[email protected]>

Kalesh Singh (5):
tracing: Add support for creating hist trigger variables from literal
tracing: Add division and multiplication support for hist triggers
tracing: Fix operator precedence for hist triggers expression
tracing/selftests: Add tests for hist trigger expression parsing
tracing/histogram: Document expression arithmetic and constants

Documentation/trace/histogram.rst | 14 +
kernel/trace/trace_events_hist.c | 318 +++++++++++++++---
.../testing/selftests/ftrace/test.d/functions | 4 +-
.../trigger/trigger-hist-expressions.tc | 73 ++++
4 files changed, 357 insertions(+), 52 deletions(-)
create mode 100644 tools/testing/selftests/ftrace/test.d/trigger/trigger-hist-expressions.tc


base-commit: 3ca706c189db861b2ca2019a0901b94050ca49d8
--
2.33.0.309.g3052b89438-goog


2021-09-15 19:55:44

by Kalesh Singh

[permalink] [raw]
Subject: [PATCH 1/5] tracing: Add support for creating hist trigger variables from literal

Currently hist trigger expressions don't support the use of numeric
literals:
e.g. echo 'hist:keys=common_pid:x=$y-1234'
--> is not valid expression syntax

Having the ability to use numeric constants in hist triggers supports
a wider range of expressions for creating variables.

Add support for creating trace event histogram variables from numeric
literals.

e.g. echo 'hist:keys=common_pid:x=1234,y=size-1024' >> event/trigger

A negative numeric constant is created, using unary minus operator
(parentheses are required).

e.g. echo 'hist:keys=common_pid:z=-(2)' >> event/trigger

Signed-off-by: Kalesh Singh <[email protected]>
---
kernel/trace/trace_events_hist.c | 80 +++++++++++++++++++++++++++++++-
1 file changed, 79 insertions(+), 1 deletion(-)

diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c
index a6061a69aa84..2802b211ccf3 100644
--- a/kernel/trace/trace_events_hist.c
+++ b/kernel/trace/trace_events_hist.c
@@ -66,7 +66,9 @@
C(EMPTY_SORT_FIELD, "Empty sort field"), \
C(TOO_MANY_SORT_FIELDS, "Too many sort fields (Max = 2)"), \
C(INVALID_SORT_FIELD, "Sort field must be a key or a val"), \
- C(INVALID_STR_OPERAND, "String type can not be an operand in expression"),
+ C(INVALID_STR_OPERAND, "String type can not be an operand in expression"), \
+ C(TOO_MANY_CONSTS, "Too many constants defined"), \
+ C(EXPECT_NUMBER, "Expecting numeric literal"),

#undef C
#define C(a, b) HIST_ERR_##a
@@ -89,6 +91,8 @@ typedef u64 (*hist_field_fn_t) (struct hist_field *field,
#define HIST_FIELD_OPERANDS_MAX 2
#define HIST_FIELDS_MAX (TRACING_MAP_FIELDS_MAX + TRACING_MAP_VARS_MAX)
#define HIST_ACTIONS_MAX 8
+#define HIST_CONST_MAX 4
+#define HIST_CONST_DIGITS_MAX 21

enum field_op_id {
FIELD_OP_NONE,
@@ -152,6 +156,9 @@ struct hist_field {
bool read_once;

unsigned int var_str_idx;
+
+ /* Numeric literals are represented as u64 */
+ u64 constant;
};

static u64 hist_field_none(struct hist_field *field,
@@ -163,6 +170,15 @@ static u64 hist_field_none(struct hist_field *field,
return 0;
}

+static u64 hist_field_const(struct hist_field *field,
+ struct tracing_map_elt *elt,
+ struct trace_buffer *buffer,
+ struct ring_buffer_event *rbe,
+ void *event)
+{
+ return field->constant;
+}
+
static u64 hist_field_counter(struct hist_field *field,
struct tracing_map_elt *elt,
struct trace_buffer *buffer,
@@ -341,6 +357,7 @@ enum hist_field_flags {
HIST_FIELD_FL_CPU = 1 << 15,
HIST_FIELD_FL_ALIAS = 1 << 16,
HIST_FIELD_FL_BUCKET = 1 << 17,
+ HIST_FIELD_FL_CONST = 1 << 18,
};

struct var_defs {
@@ -410,6 +427,7 @@ struct hist_trigger_data {
struct field_var *save_vars[SYNTH_FIELDS_MAX];
unsigned int n_save_vars;
unsigned int n_save_var_str;
+ unsigned int n_constants;
};

struct action_data;
@@ -1516,6 +1534,12 @@ static void expr_field_str(struct hist_field *field, char *expr)
{
if (field->flags & HIST_FIELD_FL_VAR_REF)
strcat(expr, "$");
+ else if (field->flags & HIST_FIELD_FL_CONST) {
+ char str[HIST_CONST_DIGITS_MAX];
+
+ snprintf(str, HIST_CONST_DIGITS_MAX, "%llu", field->constant);
+ strcat(expr, str);
+ }

strcat(expr, hist_field_name(field, 0));

@@ -1689,6 +1713,15 @@ static struct hist_field *create_hist_field(struct hist_trigger_data *hist_data,
goto out;
}

+ if (flags & HIST_FIELD_FL_CONST) {
+ hist_field->fn = hist_field_const;
+ hist_field->size = sizeof(u64);
+ hist_field->type = kstrdup("u64", GFP_KERNEL);
+ if (!hist_field->type)
+ goto free;
+ goto out;
+ }
+
if (flags & HIST_FIELD_FL_STACKTRACE) {
hist_field->fn = hist_field_none;
goto out;
@@ -2090,6 +2123,35 @@ static struct hist_field *create_alias(struct hist_trigger_data *hist_data,
return alias;
}

+static struct hist_field *parse_const(struct hist_trigger_data *hist_data,
+ char *str, char *var_name,
+ unsigned long *flags)
+{
+ struct trace_array *tr = hist_data->event_file->tr;
+ struct hist_field *field = NULL;
+ u64 constant;
+
+ if (hist_data->n_constants >= HIST_CONST_MAX) {
+ hist_err(tr, HIST_ERR_TOO_MANY_CONSTS, errpos(str));
+ return NULL;
+ }
+
+ if (kstrtoull(str, 0, &constant)) {
+ hist_err(tr, HIST_ERR_EXPECT_NUMBER, errpos(str));
+ return NULL;
+ }
+
+ *flags |= HIST_FIELD_FL_CONST;
+ field = create_hist_field(hist_data, NULL, *flags, var_name);
+ if (!field)
+ return NULL;
+
+ field->constant = constant;
+ hist_data->n_constants++;
+
+ return field;
+}
+
static struct hist_field *parse_atom(struct hist_trigger_data *hist_data,
struct trace_event_file *file, char *str,
unsigned long *flags, char *var_name)
@@ -2100,6 +2162,15 @@ static struct hist_field *parse_atom(struct hist_trigger_data *hist_data,
unsigned long buckets = 0;
int ret = 0;

+ if (isdigit(str[0])) {
+ hist_field = parse_const(hist_data, str, var_name, flags);
+ if (!hist_field) {
+ ret = -EINVAL;
+ goto out;
+ }
+ return hist_field;
+ }
+
s = strchr(str, '.');
if (s) {
s = strchr(++s, '.');
@@ -4950,6 +5021,8 @@ static void hist_field_debug_show_flags(struct seq_file *m,

if (flags & HIST_FIELD_FL_ALIAS)
seq_puts(m, " HIST_FIELD_FL_ALIAS\n");
+ else if (flags & HIST_FIELD_FL_CONST)
+ seq_puts(m, " HIST_FIELD_FL_CONST\n");
}

static int hist_field_debug_show(struct seq_file *m,
@@ -4971,6 +5044,9 @@ static int hist_field_debug_show(struct seq_file *m,
field->var.idx);
}

+ if (field->flags & HIST_FIELD_FL_CONST)
+ seq_printf(m, " constant: %llu\n", field->constant);
+
if (field->flags & HIST_FIELD_FL_ALIAS)
seq_printf(m, " var_ref_idx (into hist_data->var_refs[]): %u\n",
field->var_ref_idx);
@@ -5213,6 +5289,8 @@ static void hist_field_print(struct seq_file *m, struct hist_field *hist_field)

if (hist_field->flags & HIST_FIELD_FL_CPU)
seq_puts(m, "common_cpu");
+ else if (hist_field->flags & HIST_FIELD_FL_CONST)
+ seq_printf(m, "%llu", hist_field->constant);
else if (field_name) {
if (hist_field->flags & HIST_FIELD_FL_VAR_REF ||
hist_field->flags & HIST_FIELD_FL_ALIAS)
--
2.33.0.309.g3052b89438-goog

2021-09-15 19:56:20

by Kalesh Singh

[permalink] [raw]
Subject: [PATCH 3/5] tracing: Fix operator precedence for hist triggers expression

The current histogram expression evaluation logic evaluates the
expression from right to left. This can lead to incorrect results
if the operations are not associative (as is the case for subtraction
and, the now added, division operators).
e.g. 16-8-4-2 should be 2 not 10 --> 16-8-4-2 = ((16-8)-4)-2
64/8/4/2 should be 1 not 16 --> 64/8/4/2 = ((64/8)/4)/2

Division and multiplication are currently limited to single operation
expression due to operator precedence support not yet implemented.

Rework the expression parsing to support the correct evaluation of
expressions containing operators of different precedences; and fix
the associativity error by evaluating expressions with operators of
the same precedence from left to right.

Examples:
(1) echo 'hist:keys=common_pid:a=8,b=4,c=2,d=1,w=$a-$b-$c-$d' \
>> event/trigger
(2) echo 'hist:keys=common_pid:x=$a/$b/3/2' >> event/trigger
(3) echo 'hist:keys=common_pid:y=$a+10/$c*1024' >> event/trigger
(4) echo 'hist:keys=common_pid:z=$a/$b+$c*$d' >> event/trigger

Signed-off-by: Kalesh Singh <[email protected]>
---
kernel/trace/trace_events_hist.c | 210 ++++++++++++++++++++-----------
1 file changed, 140 insertions(+), 70 deletions(-)

diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c
index dc28eb6fc73a..994caa629e2c 100644
--- a/kernel/trace/trace_events_hist.c
+++ b/kernel/trace/trace_events_hist.c
@@ -68,7 +68,9 @@
C(INVALID_SORT_FIELD, "Sort field must be a key or a val"), \
C(INVALID_STR_OPERAND, "String type can not be an operand in expression"), \
C(TOO_MANY_CONSTS, "Too many constants defined"), \
- C(EXPECT_NUMBER, "Expecting numeric literal"),
+ C(EXPECT_NUMBER, "Expecting numeric literal"), \
+ C(UNARY_MINUS_SUBEXPR, "Unary minus not supported in sub-expressions"), \
+ C(SYM_OFFSET_SUBEXPR, ".sym-offset not supported in sub-expressions"),

#undef C
#define C(a, b) HIST_ERR_##a
@@ -1647,40 +1649,96 @@ static char *expr_str(struct hist_field *field, unsigned int level)
return expr;
}

-static int contains_operator(char *str)
+/*
+ * If field_op != FIELD_OP_NONE, *sep points to the root operator
+ * of the expression tree to be evaluated.
+ */
+static int contains_operator(char *str, char **sep)
{
enum field_op_id field_op = FIELD_OP_NONE;
- char *op;
+ char *minus_op, *plus_op, *div_op, *mult_op;
+
+
+ /*
+ * Report the last occurrence of the operators first, so that the
+ * expression is evaluated left to right. This is important since
+ * subtraction and division are not associative.
+ *
+ * e.g
+ * 64/8/4/2 is 1, i.e 64/8/4/2 = ((64/8)/4)/2
+ * 14-7-5-2 is 0, i.e 14-7-5-2 = ((14-7)-5)-2
+ */

- op = strpbrk(str, "+-/*");
- if (!op)
- return FIELD_OP_NONE;
+ /*
+ * First, find lower precedence addition and subtraction
+ * since the expression will be evaluated recursively.
+ */
+ minus_op = strrchr(str, '-');
+ if (minus_op) {
+ /* Unfortunately, the modifier ".sym-offset" can confuse things. */
+ if (minus_op - str >= 4 && !strncmp(minus_op - 4, ".sym-offset", 11))
+ goto out;

- switch (*op) {
- case '-':
/*
- * Unfortunately, the modifier ".sym-offset"
- * can confuse things.
+ * Unary minus is not supported in sub-expressions. If
+ * present, it is always the next root operator.
*/
- if (op - str >= 4 && !strncmp(op - 4, ".sym-offset", 11))
- return FIELD_OP_NONE;
-
- if (*str == '-')
+ if (minus_op == str) {
field_op = FIELD_OP_UNARY_MINUS;
- else
- field_op = FIELD_OP_MINUS;
- break;
- case '+':
- field_op = FIELD_OP_PLUS;
- break;
- case '/':
+ goto out;
+ }
+
+ field_op = FIELD_OP_MINUS;
+ }
+
+ plus_op = strrchr(str, '+');
+ if (plus_op || minus_op) {
+ /*
+ * For operators of the same precedence use to rightmost as the
+ * root, so that the expression is evaluated left to right.
+ */
+ if (plus_op > minus_op)
+ field_op = FIELD_OP_PLUS;
+ goto out;
+ }
+
+ /*
+ * Multiplication and division have higher precedence than addition and
+ * subtraction.
+ */
+ div_op = strrchr(str, '/');
+ if (div_op)
field_op = FIELD_OP_DIV;
- break;
- case '*':
+
+ mult_op = strrchr(str, '*');
+ /*
+ * For operators of the same precedence use to rightmost as the
+ * root, so that the expression is evaluated left to right.
+ */
+ if (mult_op > div_op)
field_op = FIELD_OP_MULT;
- break;
- default:
- break;
+
+out:
+ if (sep) {
+ switch (field_op) {
+ case FIELD_OP_UNARY_MINUS:
+ case FIELD_OP_MINUS:
+ *sep = minus_op;
+ break;
+ case FIELD_OP_PLUS:
+ *sep = plus_op;
+ break;
+ case FIELD_OP_DIV:
+ *sep = div_op;
+ break;
+ case FIELD_OP_MULT:
+ *sep = mult_op;
+ break;
+ case FIELD_OP_NONE:
+ default:
+ *sep = NULL;
+ break;
+ }
}

return field_op;
@@ -2006,7 +2064,7 @@ static char *field_name_from_var(struct hist_trigger_data *hist_data,

if (strcmp(var_name, name) == 0) {
field = hist_data->attrs->var_defs.expr[i];
- if (contains_operator(field) || is_var_ref(field))
+ if (contains_operator(field, NULL) || is_var_ref(field))
continue;
return field;
}
@@ -2275,21 +2333,24 @@ static struct hist_field *parse_atom(struct hist_trigger_data *hist_data,
static struct hist_field *parse_expr(struct hist_trigger_data *hist_data,
struct trace_event_file *file,
char *str, unsigned long flags,
- char *var_name, unsigned int level);
+ char *var_name, unsigned int *n_subexprs);

static struct hist_field *parse_unary(struct hist_trigger_data *hist_data,
struct trace_event_file *file,
char *str, unsigned long flags,
- char *var_name, unsigned int level)
+ char *var_name, unsigned int *n_subexprs)
{
struct hist_field *operand1, *expr = NULL;
unsigned long operand_flags;
int ret = 0;
char *s;

+ /* Unary minus operator, increment n_subexprs */
+ ++*n_subexprs;
+
/* we support only -(xxx) i.e. explicit parens required */

- if (level > 3) {
+ if (*n_subexprs > 3) {
hist_err(file->tr, HIST_ERR_TOO_MANY_SUBEXPR, errpos(str));
ret = -EINVAL;
goto free;
@@ -2306,8 +2367,16 @@ static struct hist_field *parse_unary(struct hist_trigger_data *hist_data,
}

s = strrchr(str, ')');
- if (s)
+ if (s) {
+ /* unary minus not supported in sub-expressions */
+ if (*(s+1) != '\0') {
+ hist_err(file->tr, HIST_ERR_UNARY_MINUS_SUBEXPR,
+ errpos(str));
+ ret = -EINVAL;
+ goto free;
+ }
*s = '\0';
+ }
else {
ret = -EINVAL; /* no closing ')' */
goto free;
@@ -2321,7 +2390,7 @@ static struct hist_field *parse_unary(struct hist_trigger_data *hist_data,
}

operand_flags = 0;
- operand1 = parse_expr(hist_data, file, str, operand_flags, NULL, ++level);
+ operand1 = parse_expr(hist_data, file, str, operand_flags, NULL, n_subexprs);
if (IS_ERR(operand1)) {
ret = PTR_ERR(operand1);
goto free;
@@ -2391,60 +2460,61 @@ static int check_expr_operands(struct trace_array *tr,
static struct hist_field *parse_expr(struct hist_trigger_data *hist_data,
struct trace_event_file *file,
char *str, unsigned long flags,
- char *var_name, unsigned int level)
+ char *var_name, unsigned int *n_subexprs)
{
struct hist_field *operand1 = NULL, *operand2 = NULL, *expr = NULL;
unsigned long operand_flags;
int field_op, ret = -EINVAL;
char *sep, *operand1_str;

- if (level > 3) {
+ if (*n_subexprs > 3) {
hist_err(file->tr, HIST_ERR_TOO_MANY_SUBEXPR, errpos(str));
return ERR_PTR(-EINVAL);
}

- field_op = contains_operator(str);
+ /*
+ * ".sym-offset" in expressions has no effect on their evaluation,
+ * but can confuse operator parsing.
+ */
+ if (*n_subexprs == 0) {
+ sep = strstr(str, ".sym-offset");
+ if (sep) {
+ *sep = '\0';
+ if (strpbrk(str, "+-/*") || strpbrk(sep + 11, "+-/*")) {
+ *sep = '.';
+ hist_err(file->tr, HIST_ERR_SYM_OFFSET_SUBEXPR,
+ errpos(sep));
+ return ERR_PTR(-EINVAL);
+ }
+ *sep = '.';
+ }
+ }
+
+ field_op = contains_operator(str, &sep);

if (field_op == FIELD_OP_NONE)
return parse_atom(hist_data, file, str, &flags, var_name);

if (field_op == FIELD_OP_UNARY_MINUS)
- return parse_unary(hist_data, file, str, flags, var_name, ++level);
+ return parse_unary(hist_data, file, str, flags, var_name, n_subexprs);

- switch (field_op) {
- case FIELD_OP_MINUS:
- sep = "-";
- break;
- case FIELD_OP_PLUS:
- sep = "+";
- break;
- case FIELD_OP_DIV:
- sep = "/";
- break;
- case FIELD_OP_MULT:
- sep = "*";
- break;
- default:
- goto free;
- }
+ /* Binary operator found, increment n_subexprs */
+ ++*n_subexprs;

- /*
- * Multiplication and division are only supported in single operator
- * expressions, since the expression is always evaluated from right
- * to left.
- */
- if ((field_op == FIELD_OP_DIV || field_op == FIELD_OP_MULT) && level > 0) {
- hist_err(file->tr, HIST_ERR_TOO_MANY_SUBEXPR, errpos(str));
- return ERR_PTR(-EINVAL);
- }
+ /* Split the expression string at the root operator */
+ if (!sep)
+ goto free;
+ *sep = '\0';
+ operand1_str = str;
+ str = sep+1;

- operand1_str = strsep(&str, sep);
if (!operand1_str || !str)
goto free;

operand_flags = 0;
- operand1 = parse_atom(hist_data, file, operand1_str,
- &operand_flags, NULL);
+
+ /* LHS of string is an expression e.g. a+b in a+b+c */
+ operand1 = parse_expr(hist_data, file, operand1_str, operand_flags, NULL, n_subexprs);
if (IS_ERR(operand1)) {
ret = PTR_ERR(operand1);
operand1 = NULL;
@@ -2456,9 +2526,9 @@ static struct hist_field *parse_expr(struct hist_trigger_data *hist_data,
goto free;
}

- /* rest of string could be another expression e.g. b+c in a+b+c */
+ /* RHS of string is another expression e.g. c in a+b+c */
operand_flags = 0;
- operand2 = parse_expr(hist_data, file, str, operand_flags, NULL, ++level);
+ operand2 = parse_expr(hist_data, file, str, operand_flags, NULL, n_subexprs);
if (IS_ERR(operand2)) {
ret = PTR_ERR(operand2);
operand2 = NULL;
@@ -3892,9 +3962,9 @@ static int __create_val_field(struct hist_trigger_data *hist_data,
unsigned long flags)
{
struct hist_field *hist_field;
- int ret = 0;
+ int ret = 0, n_subexprs = 0;

- hist_field = parse_expr(hist_data, file, field_str, flags, var_name, 0);
+ hist_field = parse_expr(hist_data, file, field_str, flags, var_name, &n_subexprs);
if (IS_ERR(hist_field)) {
ret = PTR_ERR(hist_field);
goto out;
@@ -4035,7 +4105,7 @@ static int create_key_field(struct hist_trigger_data *hist_data,
struct hist_field *hist_field = NULL;
unsigned long flags = 0;
unsigned int key_size;
- int ret = 0;
+ int ret = 0, n_subexprs = 0;

if (WARN_ON(key_idx >= HIST_FIELDS_MAX))
return -EINVAL;
@@ -4048,7 +4118,7 @@ static int create_key_field(struct hist_trigger_data *hist_data,
hist_field = create_hist_field(hist_data, NULL, flags, NULL);
} else {
hist_field = parse_expr(hist_data, file, field_str, flags,
- NULL, 0);
+ NULL, &n_subexprs);
if (IS_ERR(hist_field)) {
ret = PTR_ERR(hist_field);
goto out;
--
2.33.0.309.g3052b89438-goog

2021-09-15 19:56:45

by Kalesh Singh

[permalink] [raw]
Subject: [PATCH 2/5] tracing: Add division and multiplication support for hist triggers

Adds basic support for division and multiplication operations for
hist trigger variable expressions.

For simplicity this patch only supports, division and multiplication
for a single operation expression (e.g. x=$a/$b), as currently
expressions are always evaluated right to left. This can lead to some
incorrect results:

e.g. echo 'hist:keys=common_pid:x=8-4-2' >> event/trigger

8-4-2 should evaluate to 2 i.e. (8-4)-2
but currently x evaluate to 6 i.e. 8-(4-2)

Multiplication and division in sub-expressions will work correctly, once
correct operator precedence support is added (See next patch in this
series).

For the undefined case of division by 0, the histogram expression
evaluates to (u64)(-1). Since this cannot be detected when the
expression is created, it is the responsibility of the user to be
aware and account for this possibility.

Examples:
echo 'hist:keys=common_pid:a=8,b=4,x=$a/$b' \
>> event/trigger

echo 'hist:keys=common_pid:x=5*$b' \
>> event/trigger

Signed-off-by: Kalesh Singh <[email protected]>
---
kernel/trace/trace_events_hist.c | 72 +++++++++++++++++++++++++++++++-
1 file changed, 71 insertions(+), 1 deletion(-)

diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c
index 2802b211ccf3..dc28eb6fc73a 100644
--- a/kernel/trace/trace_events_hist.c
+++ b/kernel/trace/trace_events_hist.c
@@ -99,6 +99,8 @@ enum field_op_id {
FIELD_OP_PLUS,
FIELD_OP_MINUS,
FIELD_OP_UNARY_MINUS,
+ FIELD_OP_DIV,
+ FIELD_OP_MULT,
};

/*
@@ -287,6 +289,40 @@ static u64 hist_field_minus(struct hist_field *hist_field,
return val1 - val2;
}

+static u64 hist_field_div(struct hist_field *hist_field,
+ struct tracing_map_elt *elt,
+ struct trace_buffer *buffer,
+ struct ring_buffer_event *rbe,
+ void *event)
+{
+ struct hist_field *operand1 = hist_field->operands[0];
+ struct hist_field *operand2 = hist_field->operands[1];
+
+ u64 val1 = operand1->fn(operand1, elt, buffer, rbe, event);
+ u64 val2 = operand2->fn(operand2, elt, buffer, rbe, event);
+
+ /* Return -1 for the undefined case */
+ if (!val2)
+ return -1;
+
+ return val1 / val2;
+}
+
+static u64 hist_field_mult(struct hist_field *hist_field,
+ struct tracing_map_elt *elt,
+ struct trace_buffer *buffer,
+ struct ring_buffer_event *rbe,
+ void *event)
+{
+ struct hist_field *operand1 = hist_field->operands[0];
+ struct hist_field *operand2 = hist_field->operands[1];
+
+ u64 val1 = operand1->fn(operand1, elt, buffer, rbe, event);
+ u64 val2 = operand2->fn(operand2, elt, buffer, rbe, event);
+
+ return val1 * val2;
+}
+
static u64 hist_field_unary_minus(struct hist_field *hist_field,
struct tracing_map_elt *elt,
struct trace_buffer *buffer,
@@ -1595,6 +1631,12 @@ static char *expr_str(struct hist_field *field, unsigned int level)
case FIELD_OP_PLUS:
strcat(expr, "+");
break;
+ case FIELD_OP_DIV:
+ strcat(expr, "/");
+ break;
+ case FIELD_OP_MULT:
+ strcat(expr, "*");
+ break;
default:
kfree(expr);
return NULL;
@@ -1610,7 +1652,7 @@ static int contains_operator(char *str)
enum field_op_id field_op = FIELD_OP_NONE;
char *op;

- op = strpbrk(str, "+-");
+ op = strpbrk(str, "+-/*");
if (!op)
return FIELD_OP_NONE;

@@ -1631,6 +1673,12 @@ static int contains_operator(char *str)
case '+':
field_op = FIELD_OP_PLUS;
break;
+ case '/':
+ field_op = FIELD_OP_DIV;
+ break;
+ case '*':
+ field_op = FIELD_OP_MULT;
+ break;
default:
break;
}
@@ -2370,10 +2418,26 @@ static struct hist_field *parse_expr(struct hist_trigger_data *hist_data,
case FIELD_OP_PLUS:
sep = "+";
break;
+ case FIELD_OP_DIV:
+ sep = "/";
+ break;
+ case FIELD_OP_MULT:
+ sep = "*";
+ break;
default:
goto free;
}

+ /*
+ * Multiplication and division are only supported in single operator
+ * expressions, since the expression is always evaluated from right
+ * to left.
+ */
+ if ((field_op == FIELD_OP_DIV || field_op == FIELD_OP_MULT) && level > 0) {
+ hist_err(file->tr, HIST_ERR_TOO_MANY_SUBEXPR, errpos(str));
+ return ERR_PTR(-EINVAL);
+ }
+
operand1_str = strsep(&str, sep);
if (!operand1_str || !str)
goto free;
@@ -2445,6 +2509,12 @@ static struct hist_field *parse_expr(struct hist_trigger_data *hist_data,
case FIELD_OP_PLUS:
expr->fn = hist_field_plus;
break;
+ case FIELD_OP_DIV:
+ expr->fn = hist_field_div;
+ break;
+ case FIELD_OP_MULT:
+ expr->fn = hist_field_mult;
+ break;
default:
ret = -EINVAL;
goto free;
--
2.33.0.309.g3052b89438-goog

2021-09-15 19:58:11

by Kalesh Singh

[permalink] [raw]
Subject: [PATCH 5/5] tracing/histogram: Document expression arithmetic and constants

Histogram expressions now support division, and multiplication in
addition to the already supported subtraction and addition operators.

Numeric constants can also be used in a hist trigger expressions
or assigned to a variable and used by refernce in an expression.

Signed-off-by: Kalesh Singh <[email protected]>
---
Documentation/trace/histogram.rst | 14 ++++++++++++++
1 file changed, 14 insertions(+)

diff --git a/Documentation/trace/histogram.rst b/Documentation/trace/histogram.rst
index 533415644c54..e12699abaee8 100644
--- a/Documentation/trace/histogram.rst
+++ b/Documentation/trace/histogram.rst
@@ -1763,6 +1763,20 @@ using the same key and variable from yet another event::

# echo 'hist:key=pid:wakeupswitch_lat=$wakeup_lat+$switchtime_lat ...' >> event3/trigger

+Expressions support the use of addition, subtraction, multiplication and
+division operators (+-*/).
+
+Note that division by zero always returns -1.
+
+Numeric constants can also be used directly in an expression::
+
+ # echo 'hist:keys=next_pid:timestamp_secs=common_timestamp/1000000 ...' >> event/trigger
+
+or assigned to a variable and referenced in a subsequent expression::
+
+ # echo 'hist:keys=next_pid:us_per_sec=1000000 ...' >> event/trigger
+ # echo 'hist:keys=next_pid:timestamp_secs=common_timestamp/$us_per_sec ...' >> event/trigger
+
2.2.2 Synthetic Events
----------------------

--
2.33.0.309.g3052b89438-goog

2021-09-15 19:59:30

by Kalesh Singh

[permalink] [raw]
Subject: [PATCH 4/5] tracing/selftests: Add tests for hist trigger expression parsing

Add tests for the parsing of hist trigger expressions; and to
validate expression evaluation.

Signed-off-by: Kalesh Singh <[email protected]>
---
.../testing/selftests/ftrace/test.d/functions | 4 +-
.../trigger/trigger-hist-expressions.tc | 73 +++++++++++++++++++
2 files changed, 75 insertions(+), 2 deletions(-)
create mode 100644 tools/testing/selftests/ftrace/test.d/trigger/trigger-hist-expressions.tc

diff --git a/tools/testing/selftests/ftrace/test.d/functions b/tools/testing/selftests/ftrace/test.d/functions
index 000fd05e84b1..1855a63559ad 100644
--- a/tools/testing/selftests/ftrace/test.d/functions
+++ b/tools/testing/selftests/ftrace/test.d/functions
@@ -16,13 +16,13 @@ reset_tracer() { # reset the current tracer

reset_trigger_file() {
# remove action triggers first
- grep -H ':on[^:]*(' $@ |
+ grep -H ':on[^:]*(' $@ | tac |
while read line; do
cmd=`echo $line | cut -f2- -d: | cut -f1 -d"["`
file=`echo $line | cut -f1 -d:`
echo "!$cmd" >> $file
done
- grep -Hv ^# $@ |
+ grep -Hv ^# $@ | tac |
while read line; do
cmd=`echo $line | cut -f2- -d: | cut -f1 -d"["`
file=`echo $line | cut -f1 -d:`
diff --git a/tools/testing/selftests/ftrace/test.d/trigger/trigger-hist-expressions.tc b/tools/testing/selftests/ftrace/test.d/trigger/trigger-hist-expressions.tc
new file mode 100644
index 000000000000..0e2361cd334f
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/trigger/trigger-hist-expressions.tc
@@ -0,0 +1,73 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: event trigger - test histogram expression parsing
+# requires: set_event events/sched/sched_process_fork/trigger events/sched/sched_process_fork/hist error_log
+
+
+fail() { #msg
+ echo $1
+ exit_fail
+}
+
+get_hist_var() { #var_name hist_path
+ hist_output=`grep -m1 "$1: " $2`
+ hitcount=`echo $hist_output | awk '{ for (i=1; i<=NF; ++i) { if ($i ~ "hitcount:") print $(i+1)} }'`
+ var_sum=`echo $hist_output | awk '{ for (i=1; i<=NF; ++i) { if ($i ~ "'$1':") print $(i+1)} }'`
+ var_val=$(( var_sum / hitcount ))
+ echo $var_val
+}
+
+test_hist_expr() { # test_name expression expected_val
+ echo "Test hist trigger expressions - $1"
+
+ echo "hist:keys=common_pid:x=$2" > events/sched/sched_process_fork/trigger
+ echo 'hist:keys=common_pid:vals=$x' >> events/sched/sched_process_fork/trigger
+ for i in `seq 1 10` ; do ( echo "forked" > /dev/null); done
+
+ actual=`get_hist_var x events/sched/sched_process_fork/hist`
+
+ if [ $actual != $3 ]; then
+ fail "Failed hist trigger expression evaluation: Expression: $2 Expected: $3, Actual: $actual"
+ fi
+
+ reset_trigger
+}
+
+check_error() { # test_name command-with-error-pos-by-^
+ echo "Test hist trigger expressions - $1"
+ ftrace_errlog_check 'hist:sched:sched_process_fork' "$2" 'events/sched/sched_process_fork/trigger'
+ reset_trigger
+}
+
+test_hist_expr "Variable assignment" "123" "123"
+
+test_hist_expr "Subtraction not associative" "16-8-4-2" "2"
+
+test_hist_expr "Division not associative" "64/8/4/2" "1"
+
+test_hist_expr "Same precedence operators (+,-) evaluated left to right" "16-8+4+2" "14"
+
+test_hist_expr "Same precedence operators (*,/) evaluated left to right" "4*3/2*2" "12"
+
+test_hist_expr "Multiplication evaluated before addition/subtraction" "4+3*2-2" "8"
+
+test_hist_expr "Division evaluated before addition/subtraction" "4+6/2-2" "5"
+
+# Division by zero returns -1
+test_hist_expr "Handles division by zero" "3/0" "-1"
+
+# err pos is where the last detected sub-express was detected
+# this can vary depending on how the expression tree generated
+check_error "Too many subexpressions" 'hist:keys=common_pid:x=32+^10*3/20-4'
+check_error "Too many subexpressions" 'hist:keys=common_pid:x=^1+2+3+4+5'
+
+check_error "Unary minus not supported in subexpression" 'hist:keys=common_pid:x=-(^1)+2'
+
+check_error "'.sym-offset' modifier not supported in subexpression" \
+ 'hist:keys=common_pid:x=1+common_pid^.sym-offset'
+check_error "'.sym-offset' modifier not supported in subexpression" \
+ 'hist:keys=common_pid:x=common_pid^.sym-offset-2'
+check_error "'.sym-offset' modifier not supported in subexpression" \
+ 'hist:keys=common_pid:x=4/common_pid^.sym-offset*3'
+
+exit 0
--
2.33.0.309.g3052b89438-goog

2021-09-15 20:13:54

by Kalesh Singh

[permalink] [raw]
Subject: Re: [PATCH 0/5] tracing: Extend histogram triggers expression parsing

On Wed, Sep 15, 2021 at 12:53 PM Kalesh Singh <[email protected]> wrote:
>
> The frequency of the rss_stat trace event is known to be of the same
> magnitude as that of the sched_switch event on Android devices. This can
> cause flooding of the trace buffer with rss_stat traces leading to a
> decreased trace buffer capacity and loss of data.
>
> If it is not necessary to monitor very small changes in rss (as is the
> case in Android) then the rss_stat tracepoint can be throttled to only
> emit the event once there is a large enough change in the rss size.
> The original patch that introduced the rss_stat tracepoint also proposed
> a fixed throttling mechanism that only emits the rss_stat event
> when the rss size crosses a 512KB boundary. It was concluded that more
> generic support for this type of filtering/throttling was need, so that
> it can be applied to any trace event. [1]
>
> From the discussion in [1], histogram triggers seemed the most likely
> candidate to support this type of throttling. For instance to achieve the
> same throttling as was proposed in [1]:
>
> (1) Create a histogram variable to save the 512KB bucket of the rss size
> (2) Use the onchange handler to generate a synthetic event when the
> rss size bucket changes.
>
> The only missing pieces to support such a hist trigger are:
> (1) Support for setting a hist variable to a specific value -- to set
> the bucket size / granularity.
> (2) Support for division arithmetic operation -- to determine the
> corresponding bucket for an rss size.
>
> This series extends histogram trigger expressions to:
> (1) Allow assigning numeric literals to hist variable (eg. x=1234)
> and using literals directly in expressions (eg. x=size/1234)
> (2) Support division and multiplication in hist expressions.
> (eg. a=$x/$y*z); and
> (3) Fixes expression parsing for non-associative operators: subtraction
> and division. (eg. 8-4-2 should be 2 not 6)
>
> The rss_stat event can then be throttled using histogram triggers as
> below:
>
> # Create a synthetic event to monitor instead of the high frequency
> # rss_stat event
> echo 'rss_stat_throttled unsigned int mm_id; unsigned int curr;
> int member; long size' >> tracing/synthetic_events
>
> # Create a hist trigger that emits the synthetic rss_stat_throttled
> # event only when the rss size crosses a 512KB boundary.
> echo 'hist:keys=common_pid:bucket=size/0x80000:onchange($bucket)
> .rss_stat_throttled(mm_id,curr,member,size)'
> >> events/kmem/rss_stat/trigger
>

Sorry, I have a clerical mistake here. The above key should be:
s/keys=common_pid/keys=keys=mm_id,member

The rss size is specific to the mm struct's member not the pid.
The results below were captured with the correct key so no changes there.

> ------ Test Results ------
> Histograms can also be used to evaluate the effectiveness of this
> throttling by noting the Total Hits on each trigger:
>
> echo 'hist:keys=common_pid' >> events/sched/sched_switch/trigger
> echo 'hist:keys=common_pid' >> events/kmem/rss_stat/trigger
> echo 'hist:keys=common_pid'
> >> events/synthetic/rss_stat_throttled/trigger
>
> Allowing the above example (512KB granularity) run for 5 minutes on
> an arm64 device with 5.10 kernel:
>
> sched_switch : total hits = 147153
> rss_stat : total hits = 38863
> rss_stat_throttled: total hits = 2409
>
> The synthetic rss_stat_throttled event is ~16x less frequent than the
> rss_stat event when using a 512KB granularity.
>
>
> The results are more pronounced when rss size is changing at a higher
> rate in small increments. For instance the following results were obtained
> by recording the hits on the above events for a run of Android's
> lmkd_unit_test [2], which continually forks processes that map anonymous
> memory until there is an oom kill:
>
> sched_switch : total hits = 148832
> rss_stat : total hits = 4754802
> rss_stat_throttled: total hits = 96214
>
> In this stress this, the synthetic rss_stat_throttled event is ~50x less
> frequent than the rss_stat event when using a 512KB granularity.
>
>
> [1] https://lore.kernel.org/lkml/[email protected]/
> [2] https://cs.android.com/android/platform/superproject/+/master:system/memory/lmkd/tests/lmkd_test.cpp
>
> Signed-off-by: Kalesh Singh <[email protected]>
>
> Kalesh Singh (5):
> tracing: Add support for creating hist trigger variables from literal
> tracing: Add division and multiplication support for hist triggers
> tracing: Fix operator precedence for hist triggers expression
> tracing/selftests: Add tests for hist trigger expression parsing
> tracing/histogram: Document expression arithmetic and constants
>
> Documentation/trace/histogram.rst | 14 +
> kernel/trace/trace_events_hist.c | 318 +++++++++++++++---
> .../testing/selftests/ftrace/test.d/functions | 4 +-
> .../trigger/trigger-hist-expressions.tc | 73 ++++
> 4 files changed, 357 insertions(+), 52 deletions(-)
> create mode 100644 tools/testing/selftests/ftrace/test.d/trigger/trigger-hist-expressions.tc
>
>
> base-commit: 3ca706c189db861b2ca2019a0901b94050ca49d8
> --
> 2.33.0.309.g3052b89438-goog
>

2021-09-30 23:00:42

by Namhyung Kim

[permalink] [raw]
Subject: Re: [PATCH 0/5] tracing: Extend histogram triggers expression parsing

Hi Kalesh,

On Wed, Sep 15, 2021 at 1:09 PM Kalesh Singh <[email protected]> wrote:
>
> On Wed, Sep 15, 2021 at 12:53 PM Kalesh Singh <[email protected]> wrote:
> >
> > The frequency of the rss_stat trace event is known to be of the same
> > magnitude as that of the sched_switch event on Android devices. This can
> > cause flooding of the trace buffer with rss_stat traces leading to a
> > decreased trace buffer capacity and loss of data.
> >
> > If it is not necessary to monitor very small changes in rss (as is the
> > case in Android) then the rss_stat tracepoint can be throttled to only
> > emit the event once there is a large enough change in the rss size.
> > The original patch that introduced the rss_stat tracepoint also proposed
> > a fixed throttling mechanism that only emits the rss_stat event
> > when the rss size crosses a 512KB boundary. It was concluded that more
> > generic support for this type of filtering/throttling was need, so that
> > it can be applied to any trace event. [1]
> >
> > From the discussion in [1], histogram triggers seemed the most likely
> > candidate to support this type of throttling. For instance to achieve the
> > same throttling as was proposed in [1]:
> >
> > (1) Create a histogram variable to save the 512KB bucket of the rss size
> > (2) Use the onchange handler to generate a synthetic event when the
> > rss size bucket changes.
> >
> > The only missing pieces to support such a hist trigger are:
> > (1) Support for setting a hist variable to a specific value -- to set
> > the bucket size / granularity.
> > (2) Support for division arithmetic operation -- to determine the
> > corresponding bucket for an rss size.
> >
> > This series extends histogram trigger expressions to:
> > (1) Allow assigning numeric literals to hist variable (eg. x=1234)
> > and using literals directly in expressions (eg. x=size/1234)
> > (2) Support division and multiplication in hist expressions.
> > (eg. a=$x/$y*z); and
> > (3) Fixes expression parsing for non-associative operators: subtraction
> > and division. (eg. 8-4-2 should be 2 not 6)
> >
> > The rss_stat event can then be throttled using histogram triggers as
> > below:
> >
> > # Create a synthetic event to monitor instead of the high frequency
> > # rss_stat event
> > echo 'rss_stat_throttled unsigned int mm_id; unsigned int curr;
> > int member; long size' >> tracing/synthetic_events
> >
> > # Create a hist trigger that emits the synthetic rss_stat_throttled
> > # event only when the rss size crosses a 512KB boundary.
> > echo 'hist:keys=common_pid:bucket=size/0x80000:onchange($bucket)
> > .rss_stat_throttled(mm_id,curr,member,size)'
> > >> events/kmem/rss_stat/trigger
> >
>
> Sorry, I have a clerical mistake here. The above key should be:
> s/keys=common_pid/keys=keys=mm_id,member
>
> The rss size is specific to the mm struct's member not the pid.
> The results below were captured with the correct key so no changes there.
>
> > ------ Test Results ------
> > Histograms can also be used to evaluate the effectiveness of this
> > throttling by noting the Total Hits on each trigger:
> >
> > echo 'hist:keys=common_pid' >> events/sched/sched_switch/trigger
> > echo 'hist:keys=common_pid' >> events/kmem/rss_stat/trigger
> > echo 'hist:keys=common_pid'
> > >> events/synthetic/rss_stat_throttled/trigger
> >
> > Allowing the above example (512KB granularity) run for 5 minutes on
> > an arm64 device with 5.10 kernel:
> >
> > sched_switch : total hits = 147153
> > rss_stat : total hits = 38863
> > rss_stat_throttled: total hits = 2409
> >
> > The synthetic rss_stat_throttled event is ~16x less frequent than the
> > rss_stat event when using a 512KB granularity.
> >
> >
> > The results are more pronounced when rss size is changing at a higher
> > rate in small increments. For instance the following results were obtained
> > by recording the hits on the above events for a run of Android's
> > lmkd_unit_test [2], which continually forks processes that map anonymous
> > memory until there is an oom kill:
> >
> > sched_switch : total hits = 148832
> > rss_stat : total hits = 4754802
> > rss_stat_throttled: total hits = 96214
> >
> > In this stress this, the synthetic rss_stat_throttled event is ~50x less
> > frequent than the rss_stat event when using a 512KB granularity.
> >
> >
> > [1] https://lore.kernel.org/lkml/[email protected]/
> > [2] https://cs.android.com/android/platform/superproject/+/master:system/memory/lmkd/tests/lmkd_test.cpp
> >
> > Signed-off-by: Kalesh Singh <[email protected]>

Reviewed-by: Namhyung Kim <[email protected]>

Thanks,
Namhyung


> >
> > Kalesh Singh (5):
> > tracing: Add support for creating hist trigger variables from literal
> > tracing: Add division and multiplication support for hist triggers
> > tracing: Fix operator precedence for hist triggers expression
> > tracing/selftests: Add tests for hist trigger expression parsing
> > tracing/histogram: Document expression arithmetic and constants
> >
> > Documentation/trace/histogram.rst | 14 +
> > kernel/trace/trace_events_hist.c | 318 +++++++++++++++---
> > .../testing/selftests/ftrace/test.d/functions | 4 +-
> > .../trigger/trigger-hist-expressions.tc | 73 ++++
> > 4 files changed, 357 insertions(+), 52 deletions(-)
> > create mode 100644 tools/testing/selftests/ftrace/test.d/trigger/trigger-hist-expressions.tc
> >
> >
> > base-commit: 3ca706c189db861b2ca2019a0901b94050ca49d8
> > --
> > 2.33.0.309.g3052b89438-goog
> >

2021-10-02 00:57:50

by Kalesh Singh

[permalink] [raw]
Subject: Re: [PATCH 0/5] tracing: Extend histogram triggers expression parsing

On Thu, Sep 30, 2021 at 3:58 PM Namhyung Kim <[email protected]> wrote:
>
> Hi Kalesh,
>
> On Wed, Sep 15, 2021 at 1:09 PM Kalesh Singh <[email protected]> wrote:
> >
> > On Wed, Sep 15, 2021 at 12:53 PM Kalesh Singh <[email protected]> wrote:
> > >
> > > The frequency of the rss_stat trace event is known to be of the same
> > > magnitude as that of the sched_switch event on Android devices. This can
> > > cause flooding of the trace buffer with rss_stat traces leading to a
> > > decreased trace buffer capacity and loss of data.
> > >
> > > If it is not necessary to monitor very small changes in rss (as is the
> > > case in Android) then the rss_stat tracepoint can be throttled to only
> > > emit the event once there is a large enough change in the rss size.
> > > The original patch that introduced the rss_stat tracepoint also proposed
> > > a fixed throttling mechanism that only emits the rss_stat event
> > > when the rss size crosses a 512KB boundary. It was concluded that more
> > > generic support for this type of filtering/throttling was need, so that
> > > it can be applied to any trace event. [1]
> > >
> > > From the discussion in [1], histogram triggers seemed the most likely
> > > candidate to support this type of throttling. For instance to achieve the
> > > same throttling as was proposed in [1]:
> > >
> > > (1) Create a histogram variable to save the 512KB bucket of the rss size
> > > (2) Use the onchange handler to generate a synthetic event when the
> > > rss size bucket changes.
> > >
> > > The only missing pieces to support such a hist trigger are:
> > > (1) Support for setting a hist variable to a specific value -- to set
> > > the bucket size / granularity.
> > > (2) Support for division arithmetic operation -- to determine the
> > > corresponding bucket for an rss size.
> > >
> > > This series extends histogram trigger expressions to:
> > > (1) Allow assigning numeric literals to hist variable (eg. x=1234)
> > > and using literals directly in expressions (eg. x=size/1234)
> > > (2) Support division and multiplication in hist expressions.
> > > (eg. a=$x/$y*z); and
> > > (3) Fixes expression parsing for non-associative operators: subtraction
> > > and division. (eg. 8-4-2 should be 2 not 6)
> > >
> > > The rss_stat event can then be throttled using histogram triggers as
> > > below:
> > >
> > > # Create a synthetic event to monitor instead of the high frequency
> > > # rss_stat event
> > > echo 'rss_stat_throttled unsigned int mm_id; unsigned int curr;
> > > int member; long size' >> tracing/synthetic_events
> > >
> > > # Create a hist trigger that emits the synthetic rss_stat_throttled
> > > # event only when the rss size crosses a 512KB boundary.
> > > echo 'hist:keys=common_pid:bucket=size/0x80000:onchange($bucket)
> > > .rss_stat_throttled(mm_id,curr,member,size)'
> > > >> events/kmem/rss_stat/trigger
> > >
> >
> > Sorry, I have a clerical mistake here. The above key should be:
> > s/keys=common_pid/keys=keys=mm_id,member
> >
> > The rss size is specific to the mm struct's member not the pid.
> > The results below were captured with the correct key so no changes there.
> >
> > > ------ Test Results ------
> > > Histograms can also be used to evaluate the effectiveness of this
> > > throttling by noting the Total Hits on each trigger:
> > >
> > > echo 'hist:keys=common_pid' >> events/sched/sched_switch/trigger
> > > echo 'hist:keys=common_pid' >> events/kmem/rss_stat/trigger
> > > echo 'hist:keys=common_pid'
> > > >> events/synthetic/rss_stat_throttled/trigger
> > >
> > > Allowing the above example (512KB granularity) run for 5 minutes on
> > > an arm64 device with 5.10 kernel:
> > >
> > > sched_switch : total hits = 147153
> > > rss_stat : total hits = 38863
> > > rss_stat_throttled: total hits = 2409
> > >
> > > The synthetic rss_stat_throttled event is ~16x less frequent than the
> > > rss_stat event when using a 512KB granularity.
> > >
> > >
> > > The results are more pronounced when rss size is changing at a higher
> > > rate in small increments. For instance the following results were obtained
> > > by recording the hits on the above events for a run of Android's
> > > lmkd_unit_test [2], which continually forks processes that map anonymous
> > > memory until there is an oom kill:
> > >
> > > sched_switch : total hits = 148832
> > > rss_stat : total hits = 4754802
> > > rss_stat_throttled: total hits = 96214
> > >
> > > In this stress this, the synthetic rss_stat_throttled event is ~50x less
> > > frequent than the rss_stat event when using a 512KB granularity.
> > >
> > >
> > > [1] https://lore.kernel.org/lkml/[email protected]/
> > > [2] https://cs.android.com/android/platform/superproject/+/master:system/memory/lmkd/tests/lmkd_test.cpp
> > >
> > > Signed-off-by: Kalesh Singh <[email protected]>
>
> Reviewed-by: Namhyung Kim <[email protected]>

Thanks for the review Namhyung!

>
> Thanks,
> Namhyung
>
>
> > >
> > > Kalesh Singh (5):
> > > tracing: Add support for creating hist trigger variables from literal
> > > tracing: Add division and multiplication support for hist triggers
> > > tracing: Fix operator precedence for hist triggers expression
> > > tracing/selftests: Add tests for hist trigger expression parsing
> > > tracing/histogram: Document expression arithmetic and constants
> > >
> > > Documentation/trace/histogram.rst | 14 +
> > > kernel/trace/trace_events_hist.c | 318 +++++++++++++++---
> > > .../testing/selftests/ftrace/test.d/functions | 4 +-
> > > .../trigger/trigger-hist-expressions.tc | 73 ++++
> > > 4 files changed, 357 insertions(+), 52 deletions(-)
> > > create mode 100644 tools/testing/selftests/ftrace/test.d/trigger/trigger-hist-expressions.tc
> > >
> > >
> > > base-commit: 3ca706c189db861b2ca2019a0901b94050ca49d8
> > > --
> > > 2.33.0.309.g3052b89438-goog
> > >

2021-10-19 20:48:37

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 2/5] tracing: Add division and multiplication support for hist triggers

On Wed, 15 Sep 2021 19:52:46 +0000
Kalesh Singh <[email protected]> wrote:

> +static u64 hist_field_div(struct hist_field *hist_field,
> + struct tracing_map_elt *elt,
> + struct trace_buffer *buffer,
> + struct ring_buffer_event *rbe,
> + void *event)
> +{
> + struct hist_field *operand1 = hist_field->operands[0];
> + struct hist_field *operand2 = hist_field->operands[1];
> +
> + u64 val1 = operand1->fn(operand1, elt, buffer, rbe, event);
> + u64 val2 = operand2->fn(operand2, elt, buffer, rbe, event);
> +
> + /* Return -1 for the undefined case */
> + if (!val2)
> + return -1;
> +
> + return val1 / val2;

This wont work on x86 32 bit machines, as u64 division will trigger
floating point arithmetic by the compiler, and cause a fault.

You'll need to use one of the div64() helpers.

-- Steve


> +}
> +

2021-10-19 20:51:05

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 1/5] tracing: Add support for creating hist trigger variables from literal

On Wed, 15 Sep 2021 19:52:45 +0000
Kalesh Singh <[email protected]> wrote:

> Currently hist trigger expressions don't support the use of numeric
> literals:
> e.g. echo 'hist:keys=common_pid:x=$y-1234'
> --> is not valid expression syntax
>
> Having the ability to use numeric constants in hist triggers supports
> a wider range of expressions for creating variables.

I'm not against the patch, but I'm curious to what use case this would be
useful for. In the cover letter it mentions the division and multiplication
for finding associated buckets, but what is the addition / subtraction used
for?

-- Steve


>
> Add support for creating trace event histogram variables from numeric
> literals.
>
> e.g. echo 'hist:keys=common_pid:x=1234,y=size-1024' >> event/trigger
>
> A negative numeric constant is created, using unary minus operator
> (parentheses are required).
>
> e.g. echo 'hist:keys=common_pid:z=-(2)' >> event/trigger
>
> Signed-off-by: Kalesh Singh <[email protected]>
>

2021-10-19 21:26:03

by Kalesh Singh

[permalink] [raw]
Subject: Re: [PATCH 1/5] tracing: Add support for creating hist trigger variables from literal

On Tue, Oct 19, 2021 at 1:48 PM Steven Rostedt <[email protected]> wrote:
>
> On Wed, 15 Sep 2021 19:52:45 +0000
> Kalesh Singh <[email protected]> wrote:
>
> > Currently hist trigger expressions don't support the use of numeric
> > literals:
> > e.g. echo 'hist:keys=common_pid:x=$y-1234'
> > --> is not valid expression syntax
> >
> > Having the ability to use numeric constants in hist triggers supports
> > a wider range of expressions for creating variables.
>
> I'm not against the patch, but I'm curious to what use case this would be
> useful for. In the cover letter it mentions the division and multiplication
> for finding associated buckets, but what is the addition / subtraction used
> for?

Addition and subtraction are already supported operations. The use
case given in Documentation/trace/histogram.rst is for calculating
latencies. I don't have a use case where the constants may be needed
in the addition/subtraction, but for simplicity and completeness we
support them for all operations.

- Kalesh

>
> -- Steve
>
>
> >
> > Add support for creating trace event histogram variables from numeric
> > literals.
> >
> > e.g. echo 'hist:keys=common_pid:x=1234,y=size-1024' >> event/trigger
> >
> > A negative numeric constant is created, using unary minus operator
> > (parentheses are required).
> >
> > e.g. echo 'hist:keys=common_pid:z=-(2)' >> event/trigger
> >
> > Signed-off-by: Kalesh Singh <[email protected]>
> >

2021-10-19 21:26:51

by Kalesh Singh

[permalink] [raw]
Subject: Re: [PATCH 2/5] tracing: Add division and multiplication support for hist triggers

On Tue, Oct 19, 2021 at 1:45 PM Steven Rostedt <[email protected]> wrote:
>
> On Wed, 15 Sep 2021 19:52:46 +0000
> Kalesh Singh <[email protected]> wrote:
>
> > +static u64 hist_field_div(struct hist_field *hist_field,
> > + struct tracing_map_elt *elt,
> > + struct trace_buffer *buffer,
> > + struct ring_buffer_event *rbe,
> > + void *event)
> > +{
> > + struct hist_field *operand1 = hist_field->operands[0];
> > + struct hist_field *operand2 = hist_field->operands[1];
> > +
> > + u64 val1 = operand1->fn(operand1, elt, buffer, rbe, event);
> > + u64 val2 = operand2->fn(operand2, elt, buffer, rbe, event);
> > +
> > + /* Return -1 for the undefined case */
> > + if (!val2)
> > + return -1;
> > +
> > + return val1 / val2;
>
> This wont work on x86 32 bit machines, as u64 division will trigger
> floating point arithmetic by the compiler, and cause a fault.

Thanks for catching this. I'll address it in the next version.

- Kalesh

>
> You'll need to use one of the div64() helpers.
>
> -- Steve
>
>
> > +}
> > +

2021-10-19 21:30:46

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 1/5] tracing: Add support for creating hist trigger variables from literal

On Tue, 19 Oct 2021 14:21:04 -0700
Kalesh Singh <[email protected]> wrote:

> Addition and subtraction are already supported operations. The use
> case given in Documentation/trace/histogram.rst is for calculating
> latencies. I don't have a use case where the constants may be needed
> in the addition/subtraction, but for simplicity and completeness we
> support them for all operations.

Perhaps we should state that it's mostly added for completeness then.

Thanks,

-- Steve

2021-10-19 21:51:51

by Kalesh Singh

[permalink] [raw]
Subject: Re: [PATCH 1/5] tracing: Add support for creating hist trigger variables from literal

On Tue, Oct 19, 2021 at 2:26 PM Steven Rostedt <[email protected]> wrote:
>
> On Tue, 19 Oct 2021 14:21:04 -0700
> Kalesh Singh <[email protected]> wrote:
>
> > Addition and subtraction are already supported operations. The use
> > case given in Documentation/trace/histogram.rst is for calculating
> > latencies. I don't have a use case where the constants may be needed
> > in the addition/subtraction, but for simplicity and completeness we
> > support them for all operations.
>
> Perhaps we should state that it's mostly added for completeness then.

Good point. I'll update the commit message to clarify it.

Thanks,
Kalesh
>
> Thanks,
>
> -- Steve