Received-SPF: pass (google.com: domain of linux-kernel+bounces-192445-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) client-ip=2604:1380:45d1:ec00::1;
Date: Tue, 28 May 2024 15:44:44 +0200
From: Mateusz Guzik <mjguzik@gmail.com>
To: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>, 
	Tim Chen <tim.c.chen@intel.com>, Dave Chinner <dchinner@redhat.com>, 
	"Darrick J. Wong" <djwong@kernel.org>, Christian Brauner <brauner@kernel.org>, 
	Carlos Maiolino <cem@kernel.org>, Chuck Lever <chuck.lever@oracle.com>, Jan Kara <jack@suse.cz>, 
	Matthew Wilcox <willy@infradead.org>, Johannes Weiner <hannes@cmpxchg.org>, 
	Axel Rasmussen <axelrasmussen@google.com>, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, 
	linux-mm@kvack.org
Subject: Re: [PATCH 8/8] shmem,percpu_counter: add _limited_add(fbc, limit,
 amount)
Message-ID: <5eemkb4lo5eefp7ijgncgogwmadyzmvjfjmmmvfiki6cwdskfs@hi2z4drqeuz6>
References: <c7441dc6-f3bb-dd60-c670-9f5cbd9f266@google.com>
 <bb817848-2d19-bcc8-39ca-ea179af0f0b4@google.com>
 <jh3yqdz43c24ur7w2jjutyvwodsdccefo6ycmtmjyvh25hojn4@aysycyla6pom>
Precedence: bulk
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <jh3yqdz43c24ur7w2jjutyvwodsdccefo6ycmtmjyvh25hojn4@aysycyla6pom>

On Sat, May 25, 2024 at 08:00:15AM +0200, Mateusz Guzik wrote:
> On Fri, Sep 29, 2023 at 08:42:45PM -0700, Hugh Dickins wrote:
> > Percpu counter's compare and add are separate functions: without locking
> > around them (which would defeat their purpose), it has been possible to
> > overflow the intended limit.  Imagine all the other CPUs fallocating
> > tmpfs huge pages to the limit, in between this CPU's compare and its add.
> > 
> > I have not seen reports of that happening; but tmpfs's recent addition
> > of dquot_alloc_block_nodirty() in between the compare and the add makes
> > it even more likely, and I'd be uncomfortable to leave it unfixed.
> > 
> > Introduce percpu_counter_limited_add(fbc, limit, amount) to prevent it.
> > 
> 
> I think the posted (and by now landed) code is racy.
> 
> I had seen there was a follow up patch which further augmented the
> routine, but it did not alter the issue below so I'm responding to this
> thread.
> 
> > +/*
> > + * Compare counter, and add amount if the total is within limit.
> > + * Return true if amount was added, false if it would exceed limit.
> > + */
> > +bool __percpu_counter_limited_add(struct percpu_counter *fbc,
> > +				  s64 limit, s64 amount, s32 batch)
> > +{
> > +	s64 count;
> > +	s64 unknown;
> > +	unsigned long flags;
> > +	bool good;
> > +
> > +	if (amount > limit)
> > +		return false;
> > +
> > +	local_irq_save(flags);
> > +	unknown = batch * num_online_cpus();
> > +	count = __this_cpu_read(*fbc->counters);
> > +
> > +	/* Skip taking the lock when safe */
> > +	if (abs(count + amount) <= batch &&
> > +	    fbc->count + unknown <= limit) {
> > +		this_cpu_add(*fbc->counters, amount);
> > +		local_irq_restore(flags);
> > +		return true;
> > +	}
> > +
> 
> Note the value of fbc->count is *not* stabilized.
> 
> > +	raw_spin_lock(&fbc->lock);
> > +	count = fbc->count + amount;
> > +
> > +	/* Skip percpu_counter_sum() when safe */
> > +	if (count + unknown > limit) {
> > +		s32 *pcount;
> > +		int cpu;
> > +
> > +		for_each_cpu_or(cpu, cpu_online_mask, cpu_dying_mask) {
> > +			pcount = per_cpu_ptr(fbc->counters, cpu);
> > +			count += *pcount;
> > +		}
> > +	}
> > +
> > +	good = count <= limit;
> > +	if (good) {
> > +		count = __this_cpu_read(*fbc->counters);
> > +		fbc->count += count + amount;
> > +		__this_cpu_sub(*fbc->counters, count);
> > +	}
> > +
> > +	raw_spin_unlock(&fbc->lock);
> > +	local_irq_restore(flags);
> > +	return good;
> > +}
> > +
> 
> Consider 2 cpus executing the func at the same time, where one is
> updating fbc->counter in the slow path while the other is testing it in
> the fast path.
> 
> cpu0						cpu1
> 						if (abs(count + amount) <= batch &&				
> 						    fbc->count + unknown <= limit)
> /* gets past the per-cpu traversal */
> /* at this point cpu0 decided to bump fbc->count, while cpu1 decided to
>  * bump the local counter */
> 							this_cpu_add(*fbc->counters, amount);
> fbc->count += count + amount;
> 
> Suppose fbc->count update puts it close enough to the limit that an
> addition from cpu1 would put the entire thing over said limit.
> 
> Since the 2 operations are not synchronized cpu1 can observe fbc->count
> prior to the bump and update it's local counter, resulting in
> aforementioned overflow.
> 
> Am I misreading something here? Is this prevented?
> 
> To my grep the only user is quotas in shmem and I wonder if that use is
> even justified. I am aware of performance realities of atomics. However,
> it very well may be that whatever codepaths which exercise the counter
> are suffering multicore issues elsewhere, making an atomic (in a
> dedicated cacheline) a perfectly sane choice for the foreseeable future.
> Should this be true there would be 0 rush working out a fixed variant of
> the routine.

So I tried it out and I don't believe a per-cpu counter for mem usage of
shmem is warranted at this time. In a setting where usage keeps changing
there is a massive bottleneck around folio_lruvec_lock.

The rare case where a bunch of memory is just populated one off on a
tmpfs mount can probably suffer the atomic, it can only go so far up
before you are out of memory.

For the value to keep changing some of the pages will have to be
released and this is what I'm testing below by slapping ftruncate on a
test doing 1MB file writes in a loop (in will-it-scale):

diff --git a/tests/write1.c b/tests/write1.c
index d860904..790ddb3 100644
--- a/tests/write1.c
+++ b/tests/write1.c
@@ -28,6 +28,7 @@ void testcase(unsigned long long *iterations, unsigned long nr)
                        size = 0;
                        lseek(fd, 0, SEEK_SET);
                }
+               ftruncate(fd, 0);

                (*iterations)++;
        }

That this is allocates pages for 1MB size and releases them continously.

When mounting tmpfs on /tmp and benching with "./write1_processes -t 20"
(20 workers) I see almost all of the time spent spinning in
__pv_queued_spin_lock_slowpath.

As such I don't believe the per-cpu split buys anything in terms of
scalability and as I explained in the previous mail I think the routine
used here is buggy, while shmem is the only user. Should this switch to
a centralized atomic (either cmpxchg loop or xadd + backpedal) there
should be no loss in scalability given the lruvec problem. Then the
routine could be commented out or whacked for the time being.

backtraces for interested:

bpftrace -e 'kprobe:__pv_queued_spin_lock_slowpath { @[kstack()] = count(); }'

@[
    __pv_queued_spin_lock_slowpath+5
    _raw_spin_lock_irqsave+61
    folio_lruvec_lock_irqsave+95
    __page_cache_release+130
    folios_put_refs+139
    shmem_undo_range+702
    shmem_setattr+983
    notify_change+556
    do_truncate+131
    do_ftruncate+254
    __x64_sys_ftruncate+62
    do_syscall_64+87
    entry_SYSCALL_64_after_hwframe+118
]: 4750931
@[
    __pv_queued_spin_lock_slowpath+5
    _raw_spin_lock_irqsave+61
    folio_lruvec_lock_irqsave+95
    folio_batch_move_lru+139
    lru_add_drain_cpu+141
    __folio_batch_release+49
    shmem_undo_range+702
    shmem_setattr+983
    notify_change+556
    do_truncate+131
    do_ftruncate+254
    __x64_sys_ftruncate+62
    do_syscall_64+87
    entry_SYSCALL_64_after_hwframe+118
]: 4761483