MIME-Version: 1.0
References: <20211125193852.3617-1-goldstein.w.n@gmail.com>
In-Reply-To: <20211125193852.3617-1-goldstein.w.n@gmail.com>
From:   Eric Dumazet <edumazet@google.com>
Date:   Thu, 25 Nov 2021 17:50:31 -0800
Message-ID: <CANn89iLnH5B11CtzZ14nMFP7b--7aOfnQqgmsER+NYNzvnVurQ@mail.gmail.com>
Subject: Re: [PATCH v1] x86/lib: Optimize 8x loop and memory clobbers in csum_partial.c
To:     Noah Goldstein <goldstein.w.n@gmail.com>
Cc:     tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
        dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
        peterz@infradead.org, alexanderduyck@fb.com,
        linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="UTF-8"
Precedence: bulk

On Thu, Nov 25, 2021 at 11:38 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> Modify the 8x loop to that it uses two independent
> accumulators. Despite adding more instructions the latency and
> throughput of the loop is improved because the `adc` chains can now
> take advantage of multiple execution units.

Nice !

Note that I get better results if I do a different split, because the
second chain gets shorter.

First chain adds 5*8 bytes from the buffer, but first bytes are a mere
load, so that is really 4+1 additions.

Second chain adds 3*8 bytes from the buffer, plus the result coming
from the first chain, also 4+1 additions.

asm("movq 0*8(%[src]),%[res_tmp]\n\t"
    "addq 1*8(%[src]),%[res_tmp]\n\t"
    "adcq 2*8(%[src]),%[res_tmp]\n\t"
    "adcq 3*8(%[src]),%[res_tmp]\n\t"
    "adcq 4*8(%[src]),%[res_tmp]\n\t"
    "adcq $0,%[res_tmp]\n\t"
    "addq 5*8(%[src]),%[res]\n\t"
    "adcq 6*8(%[src]),%[res]\n\t"
    "adcq 7*8(%[src]),%[res]\n\t"
    "adcq %[res_tmp],%[res]\n\t"
    "adcq $0,%[res]"
    : [res] "+r" (temp64), [res_tmp] "=&r"(temp_accum)
    : [src] "r" (buff)
    : "memory");


>
> Make the memory clobbers more precise. 'buff' is read only and we know
> the exact usage range. There is no reason to write-clobber all memory.

Not sure if that matters in this function ? Or do we expect it being inlined ?

Personally, I find the "memory" constraint to be more readable than these casts
"m"(*(const char(*)[64])buff));

>
> Relative performance changes on Tigerlake:
>
> Time Unit: Ref Cycles
> Size Unit: Bytes
>
> size,   lat old,    lat new,    tput old,   tput new
>    0,     4.972,      5.054,       4.864,      4.870

Really what matters in modern networking is the case for 40 bytes, and
eventually 8 bytes.

Can you add these two cases in this nice table ?

We hardly have to checksum anything with NIC that are not decades old.

Apparently making the 64byte loop slightly longer incentives  gcc to
move it away (our intent with the unlikely() hint).

Anyway I am thinking of providing a specialized inline version for
IPv6 header checksums (40 + x*8 bytes, x being 0  pretty much all the
time),
so we will likely not use csum_partial() anymore.

Thanks !