Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18;
MIME-Version: 1.0
References: <20200724071959.3110510-1-irogers@google.com> <20200724144503.GD1180481@tassilo.jf.intel.com>
In-Reply-To: <20200724144503.GD1180481@tassilo.jf.intel.com>
From:   Ian Rogers <irogers@google.com>
Date:   Fri, 24 Jul 2020 11:13:29 -0700
Message-ID: <CAP-5=fW7FZXQ=KQsPCbO1LNnCPH8uvDisdi-kzV=vop3P7=duw@mail.gmail.com>
Subject: Re: [PATCH] perf bench: Add benchmark of find_next_bit
To:     Andi Kleen <ak@linux.intel.com>
Cc:     Peter Zijlstra <peterz@infradead.org>,
        Ingo Molnar <mingo@redhat.com>,
        Arnaldo Carvalho de Melo <acme@kernel.org>,
        Mark Rutland <mark.rutland@arm.com>,
        Alexander Shishkin <alexander.shishkin@linux.intel.com>,
        Jiri Olsa <jolsa@redhat.com>,
        Namhyung Kim <namhyung@kernel.org>,
        Thomas Gleixner <tglx@linutronix.de>,
        LKML <linux-kernel@vger.kernel.org>,
        Stephane Eranian <eranian@google.com>
Content-Type: text/plain; charset="UTF-8"
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk

On Fri, Jul 24, 2020 at 7:45 AM Andi Kleen <ak@linux.intel.com> wrote:
>
> On Fri, Jul 24, 2020 at 12:19:59AM -0700, Ian Rogers wrote:
> > for_each_set_bit, or similar functions like for_each_cpu, may be hot
> > within the kernel. If many bits were set then one could imagine on
> > Intel a "bt" instruction with every bit may be faster than the function
> > call and word length find_next_bit logic. Add a benchmark to measure
> > this.
>
> > This benchmark on AMD rome and Intel skylakex shows "bt" is not a good
> > option except for very small bitmaps.
>
> Small bitmaps is a common case in the kernel (e.g. cpu bitmaps)
>
> But the current code isn't that great for small bitmaps. It always looks horrific
> when I look at PT traces or brstackinsn, especially since it was optimized
> purely for code size at some point.
>
> Probably would be better to have different implementations for
> different sizes.

Thanks Andi!

what I was kind of expecting was "bt [mem],reg; jae" to have a fixed
cost and the find_next_bit to be a big win when the majority of bits
in a bitmask were 0 but to lose if the majority of bits in the bitmask
were 1. So the ratio of 1s and 0s was going to matter, not so much the
bitmap size. For bitmaps smaller than 32 this was pretty much the
case:

1000000 operations 1 bits set of 1 bits
  Average for_each_set_bit took: 8368.640 usec (+- 103.009 usec)
  Average test_bit loop took:    4052.360 usec (+- 15.381 usec)
1000000 operations 1 bits set of 2 bits
  Average for_each_set_bit took: 8503.770 usec (+- 4.385 usec)
  Average test_bit loop took:    6103.570 usec (+- 0.514 usec)
1000000 operations 2 bits set of 2 bits
  Average for_each_set_bit took: 12754.485 usec (+- 301.342 usec)
  Average test_bit loop took:    6882.950 usec (+- 55.254 usec)
1000000 operations 1 bits set of 4 bits
  Average for_each_set_bit took: 8517.670 usec (+- 5.712 usec)
  Average test_bit loop took:    10485.580 usec (+- 1.386 usec)
1000000 operations 2 bits set of 4 bits
  Average for_each_set_bit took: 12931.060 usec (+- 312.882 usec)
  Average test_bit loop took:    11093.980 usec (+- 43.149 usec)
1000000 operations 4 bits set of 4 bits
  Average for_each_set_bit took: 19664.760 usec (+- 588.841 usec)
  Average test_bit loop took:    12217.583 usec (+- 96.287 usec)
1000000 operations 1 bits set of 8 bits
  Average for_each_set_bit took: 8501.810 usec (+- 5.250 usec)
  Average test_bit loop took:    18924.550 usec (+- 2.999 usec)
1000000 operations 2 bits set of 8 bits
  Average for_each_set_bit took: 12760.025 usec (+- 301.881 usec)
  Average test_bit loop took:    19726.540 usec (+- 56.878 usec)
1000000 operations 4 bits set of 8 bits
  Average for_each_set_bit took: 19151.163 usec (+- 560.053 usec)
  Average test_bit loop took:    20817.317 usec (+- 96.923 usec)
1000000 operations 8 bits set of 8 bits
  Average for_each_set_bit took: 29770.865 usec (+- 1012.050 usec)
  Average test_bit loop took:    22677.150 usec (+- 176.883 usec)
1000000 operations 1 bits set of 16 bits
  Average for_each_set_bit took: 8534.640 usec (+- 4.920 usec)
  Average test_bit loop took:    36451.930 usec (+- 9.763 usec)
1000000 operations 2 bits set of 16 bits
  Average for_each_set_bit took: 12797.065 usec (+- 302.176 usec)
  Average test_bit loop took:    37178.395 usec (+- 51.756 usec)
1000000 operations 4 bits set of 16 bits
  Average for_each_set_bit took: 18736.880 usec (+- 525.846 usec)
  Average test_bit loop took:    38306.897 usec (+- 98.527 usec)
1000000 operations 8 bits set of 16 bits
  Average for_each_set_bit took: 29257.765 usec (+- 993.813 usec)
  Average test_bit loop took:    40038.798 usec (+- 167.358 usec)
1000000 operations 16 bits set of 16 bits
  Average for_each_set_bit took: 46784.984 usec (+- 1759.075 usec)
  Average test_bit loop took:    43014.266 usec (+- 298.138 usec)

 but for larger bitmaps, for example, 2048 bits where all are set,
even though the bt was a single instruction and find_next_bit was
going to have to do a lot of work to just return true (function call,
bunch of shifts) I got:

1000000 operations 2048 bits set of 2048 bits
  Average for_each_set_bit took: 2218881.255 usec (+- 106485.020 usec)
  Average test_bit loop took:    4934851.515 usec (+- 18463.511 usec)

And so other than say for 4-bit and smaller bitmaps the current
find_next_bit approach works best. I don't see an obvious bug in the
benchmark so I am surprised at how bad bt performs.

Thanks,
Ian

> -Andi