LinuxLists.cc - [PATCH v2 0/6] Improve visibility of writeback

[permalink] [raw]

Subject: [PATCH v2 4/6] writeback: add wb_monitor.py script to monitor writeback info on bdi

Add wb_monitor.py script to monitor writeback information on backing dev
which makes it easier and more convenient to observe writeback behaviors
of running system.

The wb_monitor.py script is written based on wq_monitor.py.

Following domain hierarchy is tested:
global domain (320G)
/ \
cgroup domain1(10G) cgroup domain2(10G)
| |
bdi wb1 wb2

The wb_monitor.py script output is as following:
/wb_monitor.py 252:16 -c
writeback reclaimable dirtied written avg_bw
252:16_1 0 0 0 0 102400
252:16_4284 672 820064 9230368 8410304 685612
252:16_4325 896 819840 10491264 9671648 652348
252:16 1568 1639904 19721632 18081952 1440360

writeback reclaimable dirtied written avg_bw
252:16_1 0 0 0 0 102400
252:16_4284 672 820064 9230368 8410304 685612
252:16_4325 896 819840 10491264 9671648 652348
252:16 1568 1639904 19721632 18081952 1440360
..

Signed-off-by: Kemeng Shi <[email protected]>
Suggested-by: Tejun Heo <[email protected]>
---
tools/writeback/wb_monitor.py | 172 ++++++++++++++++++++++++++++++++++
1 file changed, 172 insertions(+)
create mode 100644 tools/writeback/wb_monitor.py

diff --git a/tools/writeback/wb_monitor.py b/tools/writeback/wb_monitor.py
new file mode 100644
index 000000000000..5e3591f1f9a9
--- /dev/null
+++ b/tools/writeback/wb_monitor.py
@@ -0,0 +1,172 @@
+#!/usr/bin/env drgn
+#
+# Copyright (C) 2024 Kemeng Shi <[email protected]>
+# Copyright (C) 2024 Huawei Inc
+
+desc = """
+This is a drgn script based on wq_monitor.py to monitor writeback info on
+backing dev. For more info on drgn, visit https://github.com/osandov/drgn.
+
+ writeback(kB) Amount of dirty pages are currently being written back to
+ disk.
+
+ reclaimable(kB) Amount of pages are currently reclaimable.
+
+ dirtied(kB) Amount of pages have been dirtied.
+
+ wrttien(kB) Amount of dirty pages have been written back to disk.
+
+ avg_wb(kBps) Smoothly estimated write bandwidth of writing dirty pages
+ back to disk.
+"""
+
+import signal
+import re
+import time
+import json
+
+import drgn
+from drgn.helpers.linux.list import list_for_each_entry
+
+import argparse
+parser = argparse.ArgumentParser(description=desc,
+ formatter_class=argparse.RawTextHelpFormatter)
+parser.add_argument('bdi', metavar='REGEX', nargs='*',
+ help='Target backing device name patterns (all if empty)')
+parser.add_argument('-i', '--interval', metavar='SECS', type=float, default=1,
+ help='Monitoring interval (0 to print once and exit)')
+parser.add_argument('-j', '--json', action='store_true',
+ help='Output in json')
+parser.add_argument('-c', '--cgroup', action='store_true',
+ help='show writeback of bdi in cgroup')
+args = parser.parse_args()
+
+bdi_list = prog['bdi_list']
+
+WB_RECLAIMABLE = prog['WB_RECLAIMABLE']
+WB_WRITEBACK = prog['WB_WRITEBACK']
+WB_DIRTIED = prog['WB_DIRTIED']
+WB_WRITTEN = prog['WB_WRITTEN']
+NR_WB_STAT_ITEMS = prog['NR_WB_STAT_ITEMS']
+
+PAGE_SHIFT = prog['PAGE_SHIFT']
+
+def K(x):
+ return x << (PAGE_SHIFT - 10)
+
+class Stats:
+ def dict(self, now):
+ return { 'timestamp' : now,
+ 'name' : self.name,
+ 'writeback' : self.stats[WB_WRITEBACK],
+ 'reclaimable' : self.stats[WB_RECLAIMABLE],
+ 'dirtied' : self.stats[WB_DIRTIED],
+ 'written' : self.stats[WB_WRITTEN],
+ 'avg_wb' : self.avg_bw, }
+
+ def table_header_str():
+ return f'{"":>16} {"writeback":>10} {"reclaimable":>12} ' \
+ f'{"dirtied":>9} {"written":>9} {"avg_bw":>9}'
+
+ def table_row_str(self):
+ out = f'{self.name[-16:]:16} ' \
+ f'{self.stats[WB_WRITEBACK]:10} ' \
+ f'{self.stats[WB_RECLAIMABLE]:12} ' \
+ f'{self.stats[WB_DIRTIED]:9} ' \
+ f'{self.stats[WB_WRITTEN]:9} ' \
+ f'{self.avg_bw:9} '
+ return out
+
+ def show_header():
+ if Stats.table_fmt:
+ print()
+ print(Stats.table_header_str())
+
+ def show_stats(self):
+ if Stats.table_fmt:
+ print(self.table_row_str())
+ else:
+ print(self.dict(Stats.now))
+
+class WbStats(Stats):
+ def __init__(self, wb):
+ bdi_name = wb.bdi.dev_name.string_().decode()
+ # avoid to use bdi.wb.memcg_css which is only defined when
+ # CONFIG_CGROUP_WRITEBACK is enabled
+ if wb == wb.bdi.wb.address_of_():
+ ino = "1"
+ else:
+ ino = str(wb.memcg_css.cgroup.kn.id.value_())
+ self.name = bdi_name + '_' + ino
+
+ self.stats = [0] * NR_WB_STAT_ITEMS
+ for i in range(NR_WB_STAT_ITEMS):
+ if wb.stat[i].count >= 0:
+ self.stats[i] = int(K(wb.stat[i].count))
+ else:
+ self.stats[i] = 0
+
+ self.avg_bw = int(K(wb.avg_write_bandwidth))
+
+class BdiStats(Stats):
+ def __init__(self, bdi):
+ self.name = bdi.dev_name.string_().decode()
+ self.stats = [0] * NR_WB_STAT_ITEMS
+ self.avg_bw = 0
+
+ def collectStats(self, wb_stats):
+ for i in range(NR_WB_STAT_ITEMS):
+ self.stats[i] += wb_stats.stats[i]
+
+ self.avg_bw += wb_stats.avg_bw
+
+exit_req = False
+
+def sigint_handler(signr, frame):
+ global exit_req
+ exit_req = True
+
+def main():
+ # handle args
+ Stats.table_fmt = not args.json
+ interval = args.interval
+ cgroup = args.cgroup
+
+ re_str = None
+ if args.bdi:
+ for r in args.bdi:
+ if re_str is None:
+ re_str = r
+ else:
+ re_str += '|' + r
+
+ filter_re = re.compile(re_str) if re_str else None
+
+ # monitoring loop
+ signal.signal(signal.SIGINT, sigint_handler)
+
+ while not exit_req:
+ Stats.now = time.time()
+
+ Stats.show_header()
+ for bdi in list_for_each_entry('struct backing_dev_info', bdi_list.address_of_(), 'bdi_list'):
+ bdi_stats = BdiStats(bdi)
+ if filter_re and not filter_re.search(bdi_stats.name):
+ continue
+
+ for wb in list_for_each_entry('struct bdi_writeback', bdi.wb_list.address_of_(), 'bdi_node'):
+ wb_stats = WbStats(wb)
+ bdi_stats.collectStats(wb_stats)
+ if cgroup:
+ wb_stats.show_stats()
+
+ bdi_stats.show_stats()
+ if cgroup and Stats.table_fmt:
+ print()
+
+ if interval == 0:
+ break
+ time.sleep(interval)
+
+if __name__ == "__main__":
+ main()
--
2.30.0

2024-03-27 17:40:20

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH v2 0/6] Improve visibility of writeback

On Wed, 27 Mar 2024 23:57:45 +0800 Kemeng Shi <[email protected]> wrote:

> This series tries to improve visilibity of writeback.

Well... why? Is anyone usefully using the existing instrumentation?
What is to be gained by expanding it further? What is the case for
adding this code?

I don't recall hearing of anyone using the existing debug
instrumentation so perhaps we should remove it!

Also, I hit a build error and a pile of warnings with an arm
allnoconfig build.

2024-03-28 02:00:01

[permalink] [raw]

Subject: Re: [PATCH v2 0/6] Improve visibility of writeback

on 3/28/2024 1:40 AM, Andrew Morton wrote:
> On Wed, 27 Mar 2024 23:57:45 +0800 Kemeng Shi <[email protected]> wrote:
>
>> This series tries to improve visilibity of writeback.
>
> Well... why? Is anyone usefully using the existing instrumentation?
> What is to be gained by expanding it further? What is the case for
> adding this code?
>
> I don't recall hearing of anyone using the existing debug
> instrumentation so perhaps we should remove it!
Hi Andrew, this was discussed in [1]. In short, I use the
debug files to test change in submit patchset [1]. The
wb_monitor.py is suggested by Tejun in [2] to improve
visibility of writeback.
I use the debug files to test change in [1]. The wb_monitor.py is suggested by Tejun
in [2] to improve visibility of writeback.
>
> Also, I hit a build error and a pile of warnings with an arm
> allnoconfig build.
>
Sorry for this, I only tested on x86. I will look into this and
fix the build problem in next version.

[1] https://lore.kernel.org/lkml/[email protected]/
[2] https://lore.kernel.org/lkml/[email protected]/
[3] https://lore.kernel.org/lkml/[email protected]/

2024-03-28 08:24:16

[permalink] [raw]

Subject: Re: [PATCH v2 0/6] Improve visibility of writeback

on 3/28/2024 9:59 AM, Kemeng Shi wrote:
>
> on 3/28/2024 1:40 AM, Andrew Morton wrote:
>> On Wed, 27 Mar 2024 23:57:45 +0800 Kemeng Shi <[email protected]> wrote:
>>
>>> This series tries to improve visilibity of writeback.
>>
>> Well... why? Is anyone usefully using the existing instrumentation?
>> What is to be gained by expanding it further? What is the case for
>> adding this code?
>>
>> I don't recall hearing of anyone using the existing debug
>> instrumentation so perhaps we should remove it!
> Hi Andrew, this was discussed in [1]. In short, I use the
> debug files to test change in submit patchset [1]. The
> wb_monitor.py is suggested by Tejun in [2] to improve
> visibility of writeback.
> I use the debug files to test change in [1]. The wb_monitor.py is suggested by Tejun
> in [2] to improve visibility of writeback.
>>
>> Also, I hit a build error and a pile of warnings with an arm
>> allnoconfig build.
With arm allnoconfig build on uptodated mm-unstable branch, I don't
hit any build error but only some warnings as following:
..
mm/page-writeback.c: In function ??cgwb_calc_thresh??:
mm/page-writeback.c:906:13: warning: ??writeback?? is used uninitialized in this function [-Wuninitialized]
906 | mdtc.dirty += writeback;
| ~~~~~~~~~~~^~~~~~~~~~~~
In file included from ./include/linux/kernel.h:28,
from mm/page-writeback.c:15:
/include/linux/minmax.h:46:54: warning: ??filepages?? is used uninitialized in this function [-Wuninitialized]
46 | #define __cmp(op, x, y) ((x) __cmp_op_##op (y) ? (x) : (y))
| ^
mm/page-writeback.c:898:16: note: ??filepages?? was declared here
898 | unsigned long filepages, headroom, writeback;
| ^~~~~~~~~
In file included from ./include/linux/kernel.h:28,
from mm/page-writeback.c:15:
/include/linux/minmax.h:46:54: warning: ??headroom?? is used uninitialized in this function [-Wuninitialized]
46 | #define __cmp(op, x, y) ((x) __cmp_op_##op (y) ? (x) : (y))
| ^
mm/page-writeback.c:898:27: note: ??headroom?? was declared here
898 | unsigned long filepages, headroom, writeback;
| ^~~~~~~~
..

The only reason I can think of is that I also apply patchset [1]
for build. I mentioned patchset [1] in cover letter but I forgot
to notify the dependency to the patchset.
If this is the reason to blame for buidl error, I will send a new
set based on mm-unstable in next version.

Thanks,
Kemeng

[1] https://lore.kernel.org/lkml/[email protected]/T/#mc6455784a63d0f8aa1a2f5aff325abcdf9336b76
>>
> Sorry for this, I only tested on x86. I will look into this and
> fix the build problem in next version.
>
> [1] https://lore.kernel.org/lkml/[email protected]/
> [2] https://lore.kernel.org/lkml/[email protected]/
> [3] https://lore.kernel.org/lkml/[email protected]/
>

2024-03-28 19:15:22

[permalink] [raw]

Subject: Re: [PATCH v2 0/6] Improve visibility of writeback

On Wed, Mar 27, 2024 at 10:40:10AM -0700, Andrew Morton wrote:
> On Wed, 27 Mar 2024 23:57:45 +0800 Kemeng Shi <[email protected]> wrote:
>
> > This series tries to improve visilibity of writeback.
>
> Well... why? Is anyone usefully using the existing instrumentation?
> What is to be gained by expanding it further? What is the case for
> adding this code?
>
> I don't recall hearing of anyone using the existing debug
> instrumentation so perhaps we should remove it!

Remove debug instrumentation!? Surely you just?

I generally don't hear from users of my code when things are working, I
only expect to hear from people when it's not.

2024-03-28 19:24:30

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH v2 0/6] Improve visibility of writeback

On Thu, 28 Mar 2024 15:15:03 -0400 Kent Overstreet <[email protected]> wrote:

> On Wed, Mar 27, 2024 at 10:40:10AM -0700, Andrew Morton wrote:
> > On Wed, 27 Mar 2024 23:57:45 +0800 Kemeng Shi <[email protected]> wrote:
> >
> > > This series tries to improve visilibity of writeback.
> >
> > Well... why? Is anyone usefully using the existing instrumentation?
> > What is to be gained by expanding it further? What is the case for
> > adding this code?
> >
> > I don't recall hearing of anyone using the existing debug
> > instrumentation so perhaps we should remove it!
>
> Remove debug instrumentation!? Surely you just?

Absolutely not. Any code in the kernel should have ongoing
justification for remaining there. If no such justification exists,
out it goes.

2024-03-28 19:32:01

[permalink] [raw]

Subject: Re: [PATCH v2 0/6] Improve visibility of writeback

On Wed, Mar 27, 2024 at 11:57:45PM +0800, Kemeng Shi wrote:
> v1->v2:
> -Send cleanup to wq_monitor.py separately.
> -Add patch to avoid use after free of bdi.
> -Rename wb_calc_cg_thresh to cgwb_calc_thresh as Tejun suggested.
> -Use rcu walk to avoid use after free.
> -Add debug output to each related patches.
>
> This series tries to improve visilibity of writeback. Patch 1 make
> /sys/kernel/debug/bdi/xxx/stats show writeback info of whole bdi
> instead of only writeback info in root cgroup. Patch 2 add a new
> debug file /sys/kernel/debug/bdi/xxx/wb_stats to show per wb writeback
> info. Patch 4 add wb_monitor.py to monitor basic writeback info
> of running system, more info could be added on demand. Rest patches
> are some random cleanups. More details can be found in respective
> patches. Thanks!
> This series is on top of patchset [1].
>
> [1] https://lore.kernel.org/lkml/[email protected]/T/#mc6455784a63d0f8aa1a2f5aff325abcdf9336b76

Not bad

I've been trying to improve our ability to debug latency issues - stalls
of all sorts. While you're looking at all this code, do you think you
could find some places to collect useful latency numbers?

fs/bcachefs/time_stats.c has some code that's going to be moving out to
lib/ at some point, after I switch it to MAD; if you could hook that up
as well to a few points we could see at a glance if there are stalls
happening in the writeback path.

>
> Following domain hierarchy is tested:
> global domain (320G)
> / \
> cgroup domain1(10G) cgroup domain2(10G)
> | |
> bdi wb1 wb2
>
> /* all writeback info of bdi is successfully collected */
> # cat /sys/kernel/debug/bdi/252:16/stats:
> BdiWriteback: 448 kB
> BdiReclaimable: 1303904 kB
> BdiDirtyThresh: 189914124 kB
> DirtyThresh: 195337564 kB
> BackgroundThresh: 32516508 kB
> BdiDirtied: 3591392 kB
> BdiWritten: 2287488 kB
> BdiWriteBandwidth: 322248 kBps
> b_dirty: 0
> b_io: 0
> b_more_io: 2
> b_dirty_time: 0
> bdi_list: 1
> state: 1
>
> /* per wb writeback info is collected */
> # cat /sys/kernel/debug/bdi/252:16/wb_stats:
> cat wb_stats
> WbCgIno: 1
> WbWriteback: 0 kB
> WbReclaimable: 0 kB
> WbDirtyThresh: 0 kB
> WbDirtied: 0 kB
> WbWritten: 0 kB
> WbWriteBandwidth: 102400 kBps
> b_dirty: 0
> b_io: 0
> b_more_io: 0
> b_dirty_time: 0
> state: 1
> WbCgIno: 4284
> WbWriteback: 448 kB
> WbReclaimable: 818944 kB
> WbDirtyThresh: 3096524 kB
> WbDirtied: 2266880 kB
> WbWritten: 1447936 kB
> WbWriteBandwidth: 214036 kBps
> b_dirty: 0
> b_io: 0
> b_more_io: 1
> b_dirty_time: 0
> state: 5
> WbCgIno: 4325
> WbWriteback: 224 kB
> WbReclaimable: 819392 kB
> WbDirtyThresh: 2920088 kB
> WbDirtied: 2551808 kB
> WbWritten: 1732416 kB
> WbWriteBandwidth: 201832 kBps
> b_dirty: 0
> b_io: 0
> b_more_io: 1
> b_dirty_time: 0
> state: 5
>
> /* monitor writeback info */
> # ./wb_monitor.py 252:16 -c
> writeback reclaimable dirtied written avg_bw
> 252:16_1 0 0 0 0 102400
> 252:16_4284 672 820064 9230368 8410304 685612
> 252:16_4325 896 819840 10491264 9671648 652348
> 252:16 1568 1639904 19721632 18081952 1440360
>
>
> writeback reclaimable dirtied written avg_bw
> 252:16_1 0 0 0 0 102400
> 252:16_4284 672 820064 9230368 8410304 685612
> 252:16_4325 896 819840 10491264 9671648 652348
> 252:16 1568 1639904 19721632 18081952 1440360
> ...
>
> Kemeng Shi (6):
> writeback: protect race between bdi release and bdi_debug_stats_show
> writeback: collect stats of all wb of bdi in bdi_debug_stats_show
> writeback: support retrieving per group debug writeback stats of bdi
> writeback: add wb_monitor.py script to monitor writeback info on bdi
> writeback: rename nr_reclaimable to nr_dirty in balance_dirty_pages
> writeback: define GDTC_INIT_NO_WB to null
>
> include/linux/writeback.h | 1 +
> mm/backing-dev.c | 203 ++++++++++++++++++++++++++++++----
> mm/page-writeback.c | 31 ++++--
> tools/writeback/wb_monitor.py | 172 ++++++++++++++++++++++++++++
> 4 files changed, 378 insertions(+), 29 deletions(-)
> create mode 100644 tools/writeback/wb_monitor.py
>
> --
> 2.30.0
>

2024-03-28 19:32:17

[permalink] [raw]

Subject: Re: [PATCH v2 0/6] Improve visibility of writeback

Hello, Kent.

On Thu, Mar 28, 2024 at 03:24:35PM -0400, Kent Overstreet wrote:
> fs/bcachefs/time_stats.c has some code that's going to be moving out to
> lib/ at some point, after I switch it to MAD; if you could hook that up
> as well to a few points we could see at a glance if there are stalls
> happening in the writeback path.

Using BPF (whether through bcc or bpftrace) is likely a better approach for
this sort of detailed instrumentation. Fixed debug information is useful and
it's also a common occurrence that they don't quite reveal the full picture
of what one's trying to understand and one needs to dig a bit deeper, wider,
aggregate data in a different way, or whatever.

So, rather than adding more fixed infrastructure, I'd suggest adding places
which can easily be instrumented using the existing tools (they are really
great once you get used to them) whether that's tracepoints or just
strategically placed noinline functions.

Thanks.

--
tejun

2024-03-28 19:37:22

[permalink] [raw]

Subject: Re: [PATCH v2 0/6] Improve visibility of writeback

On Thu, Mar 28, 2024 at 12:23:52PM -0700, Andrew Morton wrote:
> On Thu, 28 Mar 2024 15:15:03 -0400 Kent Overstreet <[email protected]> wrote:
>
> > On Wed, Mar 27, 2024 at 10:40:10AM -0700, Andrew Morton wrote:
> > > On Wed, 27 Mar 2024 23:57:45 +0800 Kemeng Shi <[email protected]> wrote:
> > >
> > > > This series tries to improve visilibity of writeback.
> > >
> > > Well... why? Is anyone usefully using the existing instrumentation?
> > > What is to be gained by expanding it further? What is the case for
> > > adding this code?
> > >
> > > I don't recall hearing of anyone using the existing debug
> > > instrumentation so perhaps we should remove it!
> >
> > Remove debug instrumentation!? Surely you just?
>
> Absolutely not. Any code in the kernel should have ongoing
> justification for remaining there. If no such justification exists,
> out it goes.

Certainly, but this isn't remotely a case where I'd expect to be getting
that kind of feedback. Debugging instrumentation is very much a case
where no one notices it 99% of the time, but when you need it you
_really_ need it.

Not having it can turn a 10 minute "oh, that thing is acting wonky -
it's because your system is overloaded/your drive is wonky/x subsystem
sucks, we know about it and we're working on it" into a weeklong
bughunt, burning up expensive engineer time pointlessly.

To debug complex systems efficiently, in production, in the wild, we
need to be able to see what's going on - we need more of this stuff.

Not to say that this couldn't use more work - perhaps additional focus
on what kinds of issues we expect to need to debug with this, what the
numbers mean and are useful for, documentation on how this relates to
writeback internals, etc.

2024-03-28 19:40:24

[permalink] [raw]

Subject: Re: [PATCH v2 0/6] Improve visibility of writeback

On Thu, Mar 28, 2024 at 09:31:57AM -1000, Tejun Heo wrote:
> Hello, Kent.
>
> On Thu, Mar 28, 2024 at 03:24:35PM -0400, Kent Overstreet wrote:
> > fs/bcachefs/time_stats.c has some code that's going to be moving out to
> > lib/ at some point, after I switch it to MAD; if you could hook that up
> > as well to a few points we could see at a glance if there are stalls
> > happening in the writeback path.
>
> Using BPF (whether through bcc or bpftrace) is likely a better approach for
> this sort of detailed instrumentation. Fixed debug information is useful and
> it's also a common occurrence that they don't quite reveal the full picture
> of what one's trying to understand and one needs to dig a bit deeper, wider,
> aggregate data in a different way, or whatever.
>
> So, rather than adding more fixed infrastructure, I'd suggest adding places
> which can easily be instrumented using the existing tools (they are really
> great once you get used to them) whether that's tracepoints or just
> strategically placed noinline functions.

Collecting latency numbers at various key places is _enormously_ useful.
The hard part is deciding where it's useful to collect; that requires
intimate knowledge of the code. Once you're defining those collection
poitns statically, doing it with BPF is just another useless layer of
indirection.

The time stats stuff I wrote is _really_ cheap, and you really want this
stuff always on so that you've actually got the data you need when
you're bughunting.

2024-03-28 19:46:55

[permalink] [raw]

Subject: Re: [PATCH v2 0/6] Improve visibility of writeback

Hello,

On Thu, Mar 28, 2024 at 03:40:02PM -0400, Kent Overstreet wrote:
> Collecting latency numbers at various key places is _enormously_ useful.
> The hard part is deciding where it's useful to collect; that requires
> intimate knowledge of the code. Once you're defining those collection
> poitns statically, doing it with BPF is just another useless layer of
> indirection.

Given how much flexibility helps with debugging, claiming it useless is a
stretch.

> The time stats stuff I wrote is _really_ cheap, and you really want this
> stuff always on so that you've actually got the data you need when
> you're bughunting.

For some stats and some use cases, always being available is useful and
building fixed infra for them makes sense. For other stats and other use
cases, flexibility is pretty useful too (e.g. what if you want percentile
distribution which is filtered by some criteria?). They aren't mutually
exclusive and I'm not sure bdi wb instrumentation is on top of enough
people's minds.

As for overhead, BPF instrumentation can be _really_ cheap too. We often run
these programs per packet.

Thanks.

--
tejun

2024-03-28 20:01:40

[permalink] [raw]

Subject: Re: [PATCH v2 0/6] Improve visibility of writeback

On Thu, Mar 28, 2024 at 09:46:39AM -1000, Tejun Heo wrote:
> Hello,
>
> On Thu, Mar 28, 2024 at 03:40:02PM -0400, Kent Overstreet wrote:
> > Collecting latency numbers at various key places is _enormously_ useful.
> > The hard part is deciding where it's useful to collect; that requires
> > intimate knowledge of the code. Once you're defining those collection
> > poitns statically, doing it with BPF is just another useless layer of
> > indirection.
>
> Given how much flexibility helps with debugging, claiming it useless is a
> stretch.

Well, what would it add?

> > The time stats stuff I wrote is _really_ cheap, and you really want this
> > stuff always on so that you've actually got the data you need when
> > you're bughunting.
>
> For some stats and some use cases, always being available is useful and
> building fixed infra for them makes sense. For other stats and other use
> cases, flexibility is pretty useful too (e.g. what if you want percentile
> distribution which is filtered by some criteria?). They aren't mutually
> exclusive and I'm not sure bdi wb instrumentation is on top of enough
> people's minds.
>
> As for overhead, BPF instrumentation can be _really_ cheap too. We often run
> these programs per packet.

The main things I want are just
- elapsed time since last writeback IO completed, so we can see at a
glance if it's stalled
- time stats on writeback io initiation to completion

The main value of this one will be tracking down tail latency issues and
finding out where in the stack they originate.

2024-03-28 20:13:50

[permalink] [raw]

Subject: Re: [PATCH v2 0/6] Improve visibility of writeback

Hello,

On Thu, Mar 28, 2024 at 03:55:32PM -0400, Kent Overstreet wrote:
> > On Thu, Mar 28, 2024 at 03:40:02PM -0400, Kent Overstreet wrote:
> > > Collecting latency numbers at various key places is _enormously_ useful.
> > > The hard part is deciding where it's useful to collect; that requires
> > > intimate knowledge of the code. Once you're defining those collection
> > > poitns statically, doing it with BPF is just another useless layer of
> > > indirection.
> >
> > Given how much flexibility helps with debugging, claiming it useless is a
> > stretch.
>
> Well, what would it add?

It depends on the case but here's an example. If I'm seeing occasional tail
latency spikes, I'd want to know whether there's any correation with
specific types or sizes of IOs and if so who's issuing them and why. With
BPF, you can detect those conditions to tag and capture where exactly those
IOs are coming from and aggregate the result however you like across
thousands of machines in production without anyone noticing. That's useful,
no?

Also, actual percentile disribution is almost always a lot more insightful
than more coarsely aggregated numbers. We can't add all that to fixed infra.
In most cases not because runtime overhead would be too hight but because
the added interface and code complexity and maintenance overhead isn't
justifiable given how niche, adhoc and varied these use cases get.

> > > The time stats stuff I wrote is _really_ cheap, and you really want this
> > > stuff always on so that you've actually got the data you need when
> > > you're bughunting.
> >
> > For some stats and some use cases, always being available is useful and
> > building fixed infra for them makes sense. For other stats and other use
> > cases, flexibility is pretty useful too (e.g. what if you want percentile
> > distribution which is filtered by some criteria?). They aren't mutually
> > exclusive and I'm not sure bdi wb instrumentation is on top of enough
> > people's minds.
> >
> > As for overhead, BPF instrumentation can be _really_ cheap too. We often run
> > these programs per packet.
>
> The main things I want are just
> - elapsed time since last writeback IO completed, so we can see at a
> glance if it's stalled
> - time stats on writeback io initiation to completion
>
> The main value of this one will be tracking down tail latency issues and
> finding out where in the stack they originate.

Yeah, I mean, if always keeping those numbers around is useful for wide
enough number of users and cases, sure, go ahead and add fixed infra. I'm
not quite sure bdi wb stats fall in that bucket given how little attention
it usually gets.

Thanks.

--
tejun

2024-03-28 20:25:40

[permalink] [raw]

Subject: Re: [PATCH v2 0/6] Improve visibility of writeback

On Thu, Mar 28, 2024 at 10:13:27AM -1000, Tejun Heo wrote:
> Hello,
>
> On Thu, Mar 28, 2024 at 03:55:32PM -0400, Kent Overstreet wrote:
> > > On Thu, Mar 28, 2024 at 03:40:02PM -0400, Kent Overstreet wrote:
> > > > Collecting latency numbers at various key places is _enormously_ useful.
> > > > The hard part is deciding where it's useful to collect; that requires
> > > > intimate knowledge of the code. Once you're defining those collection
> > > > poitns statically, doing it with BPF is just another useless layer of
> > > > indirection.
> > >
> > > Given how much flexibility helps with debugging, claiming it useless is a
> > > stretch.
> >
> > Well, what would it add?
>
> It depends on the case but here's an example. If I'm seeing occasional tail
> latency spikes, I'd want to know whether there's any correation with
> specific types or sizes of IOs and if so who's issuing them and why. With
> BPF, you can detect those conditions to tag and capture where exactly those
> IOs are coming from and aggregate the result however you like across
> thousands of machines in production without anyone noticing. That's useful,
> no?

That's cool, but really esoteric. We need to be able to answer basic
questions and build an overall picture of what the system is doing
without having to reach for the big stuff.

Most users are never going to touch tracing, let alone BPF; that's too
much setup. But I can and do regularly tell users "check this, this and
this" and debug things on that basis without ever touching their
machine.

And basic latency numbers are really easy for users to understand, that
makes them doubly worthwhile to collect and make visible.

> Also, actual percentile disribution is almost always a lot more insightful
> than more coarsely aggregated numbers. We can't add all that to fixed infra.
> In most cases not because runtime overhead would be too hight but because
> the added interface and code complexity and maintenance overhead isn't
> justifiable given how niche, adhoc and varied these use cases get.

You can't calculate percentiles accurately and robustly in one pass -
that only works if your input data obeys a nice statistical
distribution, and the cases we care about are the ones where it doesn't.

>
> > > > The time stats stuff I wrote is _really_ cheap, and you really want this
> > > > stuff always on so that you've actually got the data you need when
> > > > you're bughunting.
> > >
> > > For some stats and some use cases, always being available is useful and
> > > building fixed infra for them makes sense. For other stats and other use
> > > cases, flexibility is pretty useful too (e.g. what if you want percentile
> > > distribution which is filtered by some criteria?). They aren't mutually
> > > exclusive and I'm not sure bdi wb instrumentation is on top of enough
> > > people's minds.
> > >
> > > As for overhead, BPF instrumentation can be _really_ cheap too. We often run
> > > these programs per packet.
> >
> > The main things I want are just
> > - elapsed time since last writeback IO completed, so we can see at a
> > glance if it's stalled
> > - time stats on writeback io initiation to completion
> >
> > The main value of this one will be tracking down tail latency issues and
> > finding out where in the stack they originate.
>
> Yeah, I mean, if always keeping those numbers around is useful for wide
> enough number of users and cases, sure, go ahead and add fixed infra. I'm
> not quite sure bdi wb stats fall in that bucket given how little attention
> it usually gets.

I think it should be getting a lot more attention given that memory
reclaim and writeback are generally implicated whenever a user complains
about their system going out to lunch.

2024-03-28 20:52:58

[permalink] [raw]

Subject: Re: [PATCH v2 0/6] Improve visibility of writeback

Hello, Kent.

On Thu, Mar 28, 2024 at 04:22:13PM -0400, Kent Overstreet wrote:
> Most users are never going to touch tracing, let alone BPF; that's too
> much setup. But I can and do regularly tell users "check this, this and
> this" and debug things on that basis without ever touching their
> machine.

I think this is where the disconnect is. It's not difficult to set up at
all. Nowadays, in most distros, it comes down to something like run "pacman
-S bcc" and then run "/usr/share/bcc/tools/biolatpcts" with these params or
run this script I'm attaching. It is a signficant boost when debugging many
different kernel issues. I strongly suggest giving it a try and getting used
to it rather than resisting it.

Thanks.

--
tejun

2024-03-28 20:54:05

[permalink] [raw]

Subject: Re: [PATCH v2 0/6] Improve visibility of writeback

On Thu, Mar 28, 2024 at 10:46:33AM -1000, Tejun Heo wrote:
> Hello, Kent.
>
> On Thu, Mar 28, 2024 at 04:22:13PM -0400, Kent Overstreet wrote:
> > Most users are never going to touch tracing, let alone BPF; that's too
> > much setup. But I can and do regularly tell users "check this, this and
> > this" and debug things on that basis without ever touching their
> > machine.
>
> I think this is where the disconnect is. It's not difficult to set up at
> all. Nowadays, in most distros, it comes down to something like run "pacman
> -S bcc" and then run "/usr/share/bcc/tools/biolatpcts" with these params or
> run this script I'm attaching. It is a signficant boost when debugging many
> different kernel issues. I strongly suggest giving it a try and getting used
> to it rather than resisting it.

The disconnect is with you, Tejun. I use tracing all the time, and it's
not going to be much of a discussion if you're not going to formulate a
response.

2024-04-03 06:59:00

[permalink] [raw]

Subject: Re: [PATCH v2 0/6] Improve visibility of writeback

on 3/29/2024 3:24 AM, Kent Overstreet wrote:
> On Wed, Mar 27, 2024 at 11:57:45PM +0800, Kemeng Shi wrote:
>> v1->v2:
>> -Send cleanup to wq_monitor.py separately.
>> -Add patch to avoid use after free of bdi.
>> -Rename wb_calc_cg_thresh to cgwb_calc_thresh as Tejun suggested.
>> -Use rcu walk to avoid use after free.
>> -Add debug output to each related patches.
>>
>> This series tries to improve visilibity of writeback. Patch 1 make
>> /sys/kernel/debug/bdi/xxx/stats show writeback info of whole bdi
>> instead of only writeback info in root cgroup. Patch 2 add a new
>> debug file /sys/kernel/debug/bdi/xxx/wb_stats to show per wb writeback
>> info. Patch 4 add wb_monitor.py to monitor basic writeback info
>> of running system, more info could be added on demand. Rest patches
>> are some random cleanups. More details can be found in respective
>> patches. Thanks!
>> This series is on top of patchset [1].
>>
>> [1] https://lore.kernel.org/lkml/[email protected]/T/#mc6455784a63d0f8aa1a2f5aff325abcdf9336b76
>
> Not bad
>
Hi Kent,
> I've been trying to improve our ability to debug latency issues - stalls
> of all sorts. While you're looking at all this code, do you think you
> could find some places to collect useful latency numbers?
I would like to do it to collect more useful info for writeback.
>
> fs/bcachefs/time_stats.c has some code that's going to be moving out to
> lib/ at some point, after I switch it to MAD; if you could hook that up
> as well to a few points we could see at a glance if there are stalls
> happening in the writeback path.
I see that Tejun recommend to use bpf. I don't know much about bpf and
new approach in time_stats.c. For me personly, I think that it's better
to use the new approach after the work of moving code to lib/ is merged.
Then I would like to submit a patchset to discuss of using it in writeback.
Would this make sense to you. Look forward to your reply!

Thanks,
Kemeng
>
>>
>> Following domain hierarchy is tested:
>> global domain (320G)
>> / \
>> cgroup domain1(10G) cgroup domain2(10G)
>> | |
>> bdi wb1 wb2
>>
>> /* all writeback info of bdi is successfully collected */
>> # cat /sys/kernel/debug/bdi/252:16/stats:
>> BdiWriteback: 448 kB
>> BdiReclaimable: 1303904 kB
>> BdiDirtyThresh: 189914124 kB
>> DirtyThresh: 195337564 kB
>> BackgroundThresh: 32516508 kB
>> BdiDirtied: 3591392 kB
>> BdiWritten: 2287488 kB
>> BdiWriteBandwidth: 322248 kBps
>> b_dirty: 0
>> b_io: 0
>> b_more_io: 2
>> b_dirty_time: 0
>> bdi_list: 1
>> state: 1
>>
>> /* per wb writeback info is collected */
>> # cat /sys/kernel/debug/bdi/252:16/wb_stats:
>> cat wb_stats
>> WbCgIno: 1
>> WbWriteback: 0 kB
>> WbReclaimable: 0 kB
>> WbDirtyThresh: 0 kB
>> WbDirtied: 0 kB
>> WbWritten: 0 kB
>> WbWriteBandwidth: 102400 kBps
>> b_dirty: 0
>> b_io: 0
>> b_more_io: 0
>> b_dirty_time: 0
>> state: 1
>> WbCgIno: 4284
>> WbWriteback: 448 kB
>> WbReclaimable: 818944 kB
>> WbDirtyThresh: 3096524 kB
>> WbDirtied: 2266880 kB
>> WbWritten: 1447936 kB
>> WbWriteBandwidth: 214036 kBps
>> b_dirty: 0
>> b_io: 0
>> b_more_io: 1
>> b_dirty_time: 0
>> state: 5
>> WbCgIno: 4325
>> WbWriteback: 224 kB
>> WbReclaimable: 819392 kB
>> WbDirtyThresh: 2920088 kB
>> WbDirtied: 2551808 kB
>> WbWritten: 1732416 kB
>> WbWriteBandwidth: 201832 kBps
>> b_dirty: 0
>> b_io: 0
>> b_more_io: 1
>> b_dirty_time: 0
>> state: 5
>>
>> /* monitor writeback info */
>> # ./wb_monitor.py 252:16 -c
>> writeback reclaimable dirtied written avg_bw
>> 252:16_1 0 0 0 0 102400
>> 252:16_4284 672 820064 9230368 8410304 685612
>> 252:16_4325 896 819840 10491264 9671648 652348
>> 252:16 1568 1639904 19721632 18081952 1440360
>>
>>
>> writeback reclaimable dirtied written avg_bw
>> 252:16_1 0 0 0 0 102400
>> 252:16_4284 672 820064 9230368 8410304 685612
>> 252:16_4325 896 819840 10491264 9671648 652348
>> 252:16 1568 1639904 19721632 18081952 1440360
>> ...
>>
>> Kemeng Shi (6):
>> writeback: protect race between bdi release and bdi_debug_stats_show
>> writeback: collect stats of all wb of bdi in bdi_debug_stats_show
>> writeback: support retrieving per group debug writeback stats of bdi
>> writeback: add wb_monitor.py script to monitor writeback info on bdi
>> writeback: rename nr_reclaimable to nr_dirty in balance_dirty_pages
>> writeback: define GDTC_INIT_NO_WB to null
>>
>> include/linux/writeback.h | 1 +
>> mm/backing-dev.c | 203 ++++++++++++++++++++++++++++++----
>> mm/page-writeback.c | 31 ++++--
>> tools/writeback/wb_monitor.py | 172 ++++++++++++++++++++++++++++
>> 4 files changed, 378 insertions(+), 29 deletions(-)
>> create mode 100644 tools/writeback/wb_monitor.py
>>
>> --
>> 2.30.0
>>
>

2024-04-03 16:37:30

by Jan Kara

[permalink] [raw]

Subject: Re: [PATCH v2 0/6] Improve visibility of writeback

On Thu 28-03-24 10:46:33, Tejun Heo wrote:
> Hello, Kent.
>
> On Thu, Mar 28, 2024 at 04:22:13PM -0400, Kent Overstreet wrote:
> > Most users are never going to touch tracing, let alone BPF; that's too
> > much setup. But I can and do regularly tell users "check this, this and
> > this" and debug things on that basis without ever touching their
> > machine.
>
> I think this is where the disconnect is. It's not difficult to set up at
> all. Nowadays, in most distros, it comes down to something like run "pacman
> -S bcc" and then run "/usr/share/bcc/tools/biolatpcts" with these params or
> run this script I'm attaching. It is a signficant boost when debugging many
> different kernel issues. I strongly suggest giving it a try and getting used
> to it rather than resisting it.

Yeah, BPF is great and I use it but to fill in some cases from practice,
there are sysadmins refusing to install bcc or run your BPF scripts on
their systems due to company regulations, their personal fear, or whatever.
So debugging with what you can achieve from a shell is still the thing
quite often.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2024-04-03 18:45:11

[permalink] [raw]

Subject: Re: [PATCH v2 0/6] Improve visibility of writeback

Hello,

On Wed, Apr 03, 2024 at 06:27:16PM +0200, Jan Kara wrote:
> Yeah, BPF is great and I use it but to fill in some cases from practice,
> there are sysadmins refusing to install bcc or run your BPF scripts on
> their systems due to company regulations, their personal fear, or whatever.
> So debugging with what you can achieve from a shell is still the thing
> quite often.

Yeah, I mean, this happens with anything new. Tracing itself took quite a
while to be adopted widely. BPF, bcc, bpftrace are all still pretty new and
it's likely that the adoption line will keep shifting for quite a while.
Besides, even with all the new gizmos there definitely are cases where good
ol' cat interface makes sense.

So, if the static interface makes sense, we add it but we should keep in
mind that the trade-offs for adding such static infrastructure, especially
for the ones which aren't *widely* useful, are rather quickly shfiting in
the less favorable direction.

Thanks.

--
tejun

2024-04-03 19:21:53

[permalink] [raw]

Subject: Re: [PATCH v2 0/6] Improve visibility of writeback

Hello,

On Wed, Apr 03, 2024 at 03:06:56PM -0400, Kent Overstreet wrote:
..
> That's how it should be if you just make a point of making your internal
> state easy to view and introspect, but when I'm debugging issues that
> run into the wider block layer, or memory reclaim, we often hit a wall.

Try drgn:

https://drgn.readthedocs.io/en/latest/

I've been adding drgn scripts under tools/ directory for introspection.
They're easy to write, deploy and ask users to run.

> Writeback throttling was buggy for _months_, no visibility or
> introspection or concerns for debugging, and that's a small chunk of
> code. io_uring - had to disable it. I _still_ have people bringing
> issues to me that are clearly memory reclaim related but I don't have
> the tools.
>
> It's not like any of this code exports much in the way of useful
> tracepoints either, but tracepoints often just aren't what you want;
> what you want just to be able to see internal state (_without_ having to
> use a debugger, because that's completely impractical outside highly
> controlled environments) - and tracing is also never the first thing you
> want to reach for when you have a user asking you "hey, this thing went
> wonky, what's it doing?" - tracing automatically turns it into a multi
> step process of decide what you want to look at, run the workload more
> to collect data, iterate.
>
> Think more about "what would make code easier to debug" and less about
> "how do I shove this round peg through the square tracing/BPF slot".
> There's _way_ more we could be doing that would just make our lives
> easier.

Maybe it'd help classifying visibility into the the following categories:

1. Current state introspection.
2. Dynamic behavior tracing.
3. Accumluative behavior profiling.

drgn is great for #1. Tracing and BPF stuff is great for #2 especially when
things get complicated. #3 is the trickest. Static stuff is useful in a lot
of cases but BPF can also be useful in other cases.

I agree that it's all about using the right tool for the problem.

--
tejun

2024-04-03 19:33:38

[permalink] [raw]

Subject: Re: [PATCH v2 0/6] Improve visibility of writeback

On Wed, Apr 03, 2024 at 08:44:05AM -1000, Tejun Heo wrote:
> Hello,
>
> On Wed, Apr 03, 2024 at 06:27:16PM +0200, Jan Kara wrote:
> > Yeah, BPF is great and I use it but to fill in some cases from practice,
> > there are sysadmins refusing to install bcc or run your BPF scripts on
> > their systems due to company regulations, their personal fear, or whatever.
> > So debugging with what you can achieve from a shell is still the thing
> > quite often.
>
> Yeah, I mean, this happens with anything new. Tracing itself took quite a
> while to be adopted widely. BPF, bcc, bpftrace are all still pretty new and
> it's likely that the adoption line will keep shifting for quite a while.
> Besides, even with all the new gizmos there definitely are cases where good
> ol' cat interface makes sense.
>
> So, if the static interface makes sense, we add it but we should keep in
> mind that the trade-offs for adding such static infrastructure, especially
> for the ones which aren't *widely* useful, are rather quickly shfiting in
> the less favorable direction.

A lot of our static debug infrastructure isn't that useful because it
just sucks.

Every time I hit a sysfs or procfs file that's just a single integer,
and nothing else, when clearly there's internal structure and
description that needs to be there I die a little inside. It's lazy and
amateurish.

I regularly debug things in bcachefs over IRC in about 5-10 minutes of
asking to check various files and pastebin them - this is my normal
process, I pretty much never have to ssh and touch the actual machines.

That's how it should be if you just make a point of making your internal
state easy to view and introspect, but when I'm debugging issues that
run into the wider block layer, or memory reclaim, we often hit a wall.

Writeback throttling was buggy for _months_, no visibility or
introspection or concerns for debugging, and that's a small chunk of
code. io_uring - had to disable it. I _still_ have people bringing
issues to me that are clearly memory reclaim related but I don't have
the tools.

It's not like any of this code exports much in the way of useful
tracepoints either, but tracepoints often just aren't what you want;
what you want just to be able to see internal state (_without_ having to
use a debugger, because that's completely impractical outside highly
controlled environments) - and tracing is also never the first thing you
want to reach for when you have a user asking you "hey, this thing went
wonky, what's it doing?" - tracing automatically turns it into a multi
step process of decide what you want to look at, run the workload more
to collect data, iterate.

Think more about "what would make code easier to debug" and less about
"how do I shove this round peg through the square tracing/BPF slot".
There's _way_ more we could be doing that would just make our lives
easier.

2024-04-03 22:25:24