Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18;
Date:   Thu, 8 Apr 2021 12:48:49 -0700
From:   Daniel Xu <dxu@dxuuu.xyz>
To:     Matthew Wilcox <willy@infradead.org>
Cc:     bpf@vger.kernel.org, linux-fsdevel@vger.kernel.org,
        linux-mm@kvack.org, linux-kernel@vger.kernel.org,
        kernel-team@fb.com, jolsa@kernel.org, hannes@cmpxchg.org,
        yhs@fb.com
Subject: Re: [RFC bpf-next 1/1] bpf: Introduce iter_pagecache
Message-ID: <20210408194849.wmueo74qcxghhf2d@dlxu-fedora-R90QNFJV>
References: <cover.1617831474.git.dxu@dxuuu.xyz>
 <22bededbd502e0df45326a54b3056941de65a101.1617831474.git.dxu@dxuuu.xyz>
 <20210408061401.GI2531743@casper.infradead.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20210408061401.GI2531743@casper.infradead.org>
Precedence: bulk

On Thu, Apr 08, 2021 at 07:14:01AM +0100, Matthew Wilcox wrote:
> On Wed, Apr 07, 2021 at 02:46:11PM -0700, Daniel Xu wrote:
> > +struct bpf_iter_seq_pagecache_info {
> > +	struct mnt_namespace *ns;
> > +	struct radix_tree_root superblocks;
> 
> Why are you adding a new radix tree?  Use an XArray instead.

Ah right, sorry. Will do.

> > +static struct page *goto_next_page(struct bpf_iter_seq_pagecache_info *info)
> > +{
> > +	struct page *page, *ret = NULL;
> > +	unsigned long idx;
> > +
> > +	rcu_read_lock();
> > +retry:
> > +	BUG_ON(!info->cur_inode);
> > +	ret = NULL;
> > +	xa_for_each_start(&info->cur_inode->i_data.i_pages, idx, page,
> > +			  info->cur_page_idx) {
> > +		if (!page_cache_get_speculative(page))
> > +			continue;
> 
> Why do you feel the need to poke around in i_pages directly?  Is there
> something wrong with find_get_entries()?

No reason other than I didn't know about the latter. Thanks for the
hint. find_get_entries() seems to return a pagevec of entries which
would complicate the iteration (a 4th layer of things to iterate over).

But I did find find_get_pages_range() which I think can be used to find
1 page at a time. I'll look into it further.

> > +static int __pagecache_seq_show(struct seq_file *seq, struct page *page,
> > +				bool in_stop)
> > +{
> > +	struct bpf_iter_meta meta;
> > +	struct bpf_iter__pagecache ctx;
> > +	struct bpf_prog *prog;
> > +
> > +	meta.seq = seq;
> > +	prog = bpf_iter_get_info(&meta, in_stop);
> > +	if (!prog)
> > +		return 0;
> > +
> > +	meta.seq = seq;
> > +	ctx.meta = &meta;
> > +	ctx.page = page;
> > +	return bpf_iter_run_prog(prog, &ctx);
> 
> I'm not really keen on the idea of random BPF programs being able to poke
> at pages in the page cache like this.  From your initial description,
> it sounded like all you needed was a list of which pages are present.

Could you elaborate on what "list of which pages are present" implies?
The overall goal with this patch is to detect duplicate content in the
page cache. So anything that helps achieve that goal I would (in theory)
be OK with.

My understanding is the user would need to hash the contents
of each page in the page cache. And BPF provides the flexibility such
that this work could be reused for currently unanticipated use cases.

Furthermore, bpf programs could already look at all the pages in the
page cache by hooking into tracepoint:filemap:mm_filemap_add_to_page_cache,
albeit at a much slower rate. I figure the downside of adding this
page cache iterator is we're explicitly condoning the behavior.

> > +	INIT_RADIX_TREE(&info->superblocks, GFP_KERNEL);
> > +
> > +	spin_lock(&info->ns->ns_lock);
> > +	list_for_each_entry(mnt, &info->ns->list, mnt_list) {
> > +		sb = mnt->mnt.mnt_sb;
> > +
> > +		/* The same mount may be mounted in multiple places */
> > +		if (radix_tree_lookup(&info->superblocks, (unsigned long)sb))
> > +			continue;
> > +
> > +		err = radix_tree_insert(&info->superblocks,
> > +				        (unsigned long)sb, (void *)1);
> > +		if (err)
> > +			goto out;
> > +	}
> > +
> > +	radix_tree_for_each_slot(slot, &info->superblocks, &iter, 0) {
> > +		sb = (struct super_block *)iter.index;
> > +		atomic_inc(&sb->s_active);
> > +	}
> 
> Uh.  What on earth made you think this was a good way to use the radix
> tree?  And, no, the XArray doesn't change that.

The idea behind the radix tree was to deduplicate the mounts by
superblock. Because a single filesystem may be mounted in different
locations. I didn't find a set data structure I could reuse so I
figured radix tree / xarray would work too.

Happy to take any better ideas too.

> If you don't understand why this is so bad, call xa_dump() on it after
> constructing it.  I'll wait.

I did a dump and got the following results: http://ix.io/2VpY .

I receieved a hint that you may be referring to how the xarray/radix
tree would be as large as the largest pointer. To my uneducated eye it
doesn't look like that's the case in this dump. Could you please
clarify?

<...>

Thanks,
Daniel