Hello,
I'm using BPF to do NFS operation accounting for user-space processes. I'd like
to include the number of bytes read and written to each file any processes open
over NFS.
For write operations, I'm currently using an fexit probe on the
nfs_writeback_done function, and my program appears to be getting the
information I'm hoping for. But I can see that under some circumstances the
actual operations are being done by kworker threads, and so the PID reported by
the BPF program is for that kworker instead of the user-space process that
requested the write.
Is there a more appropriate function to probe for this information if I only
want it triggered in context of the user-space process that performed the
write? If not, I'm wondering if there's enough information in a probe triggered
in the kworker context to track down the user-space PID that initiated the
writes.
I didn't find anything related in the kernel's Documentation directory, and I'm
not yet proficient enough with the vfs, nfs, and sunrpc code to find an
appropriate function myself.
If it matters, our infrastructure is all based on NFSv3.
Thanks for any leads or documentation pointers!
Lars
On a related note, I have always wondered if there was any interest in
having something like the /proc/PID/io just for tracking NFS client
throughput?
The problem is that if you copy a file from NFS to a local filesystem,
there is no way to infer whether a process did a NFS read/write (or
any NFS IO at all).
It is useful to track per PID network IO and things like cgroups (v1)
do not provide an easy way to do that. In our case, 99.9% of all
network IO a render blade does is NFS client traffic.
To your question, I can't say what the BPF equivalent is, but we used
systemtap to track per process and per file IO on each render node.
However, again we are only interested in IO that results in actual
network packets so we needed to account for reads from page cache too.
We did it by watching vfs.add_to_page_cache and naively assuming every
hit resulted in 4k of network NFS reads. In this way we infer that the
read comes over the network as it's not in the page cache yet. The
aggregate from all clients matched the network of our NFS servers
pretty well so this approach worked for us. We could track all client
file IO and correlate it with what the server was doing over the
network.
The systemtap code was something like the following where files were
tracked by nfs.fop.open:
probe nfs.fop.open {
pid = pid()
filename = sprintf("%s", d_path(&\$filp->f_path))
if (filename =~ "/net/.*/data") {
files[pid, ino] = filename
if ( !([pid, ino] in procinfo))
procinfo[pid, ino] = sprintf("%s", proc())
}
}
probe vfs.add_to_page_cache {
pid = pid()
if ([pid, ino] in files ) {
readpage[pid, ino] += 4096
files_store[pid, ino] = sprintf("%s", files[pid, ino])
}
}
But I should say that this no longer works in newer kernels since the
addition of folios and I have not figured out a better way to track
NFS client reads while excluding the page cache results.
For the writes I was just using vfs.write and vfs.writev - I was not
too concerned about writeback delays.
probe vfs.write {
pid = pid()
if ([pid, ino] in files) {
write[pid, ino] += bytes_to_write
files_store[pid, ino] = sprintf("%s", files[pid, ino])
}
}
I hope that helps. Being from the same industry, we obviously have
similar requirements... ;)
Daire
On Fri, 21 Jul 2023 at 23:46, <[email protected]> wrote:
>
> Hello,
>
> I'm using BPF to do NFS operation accounting for user-space processes. I'd like
> to include the number of bytes read and written to each file any processes open
> over NFS.
>
> For write operations, I'm currently using an fexit probe on the
> nfs_writeback_done function, and my program appears to be getting the
> information I'm hoping for. But I can see that under some circumstances the
> actual operations are being done by kworker threads, and so the PID reported by
> the BPF program is for that kworker instead of the user-space process that
> requested the write.
>
> Is there a more appropriate function to probe for this information if I only
> want it triggered in context of the user-space process that performed the
> write? If not, I'm wondering if there's enough information in a probe triggered
> in the kworker context to track down the user-space PID that initiated the
> writes.
>
> I didn't find anything related in the kernel's Documentation directory, and I'm
> not yet proficient enough with the vfs, nfs, and sunrpc code to find an
> appropriate function myself.
>
> If it matters, our infrastructure is all based on NFSv3.
>
> Thanks for any leads or documentation pointers!
> Lars
Hi Lars,
On Fri, 2023-07-21 at 15:45 -0700, [email protected] wrote:
> Hello,
>
> I'm using BPF to do NFS operation accounting for user-space
> processes. I'd like
> to include the number of bytes read and written to each file any
> processes open
> over NFS.
>
> For write operations, I'm currently using an fexit probe on the
> nfs_writeback_done function, and my program appears to be getting the
> information I'm hoping for. But I can see that under some
> circumstances the
> actual operations are being done by kworker threads, and so the PID
> reported by
> the BPF program is for that kworker instead of the user-space process
> that
> requested the write.
>
> Is there a more appropriate function to probe for this information if
> I only
> want it triggered in context of the user-space process that performed
> the
> write? If not, I'm wondering if there's enough information in a probe
> triggered
> in the kworker context to track down the user-space PID that
> initiated the
> writes.
>
> I didn't find anything related in the kernel's Documentation
> directory, and I'm
> not yet proficient enough with the vfs, nfs, and sunrpc code to find
> an
> appropriate function myself.
>
> If it matters, our infrastructure is all based on NFSv3.
>
> Thanks for any leads or documentation pointers!
> Lars
I tend to use the nfs:nfs_writeback_done and nfs:nfs_commit_done
tracepoints.
We make no attempt to track the PID that initiated the writes, because
it is often impossible to do so, for instance, if the file was mmapped,
or multiple processes owned by the same user are writing to the same
page.
If you want to track I/O at that level, I suggest rather tracing the
sys_write()/sys_writev()/... system calls since those will be called
from the user context.
Cheers
Trond
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]