Subject: Re: [PATCH 11/15] iio: buffer-dma: Boost performance using
 write-combine cache setting
To:     Jonathan Cameron <jic23@kernel.org>,
        Paul Cercueil <paul@crapouillou.net>
Cc:     Alexandru Ardelean <ardeleanalex@gmail.com>,
        Michael Hennerich <Michael.Hennerich@analog.com>,
        Sumit Semwal <sumit.semwal@linaro.org>,
        =?UTF-8?Q?Christian_K=c3=b6nig?= <christian.koenig@amd.com>,
        linux-iio@vger.kernel.org, linux-kernel@vger.kernel.org,
        linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org,
        linaro-mm-sig@lists.linaro.org
References: <20211115141925.60164-1-paul@crapouillou.net>
 <20211115141925.60164-12-paul@crapouillou.net>
 <20211121150037.2a606be0@jic23-huawei> <8WNX2R.M4XE9MQC24W22@crapouillou.net>
 <YX153R.0PENWW3ING7F1@crapouillou.net> <20211127160533.5259f486@jic23-huawei>
From:   Lars-Peter Clausen <lars@metafoo.de>
Message-ID: <e46c6f2d-26d7-64d2-ebfc-ff6dc78aad2f@metafoo.de>
Date:   Sun, 28 Nov 2021 14:25:55 +0100
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
 Thunderbird/78.14.0
MIME-Version: 1.0
In-Reply-To: <20211127160533.5259f486@jic23-huawei>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Content-Language: en-US
Precedence: bulk

On 11/27/21 5:05 PM, Jonathan Cameron wrote:
>> Non-coherent mapping with no cache sync:
>> - fileio:
>>      read:	156 MiB/s
>>      write:	123 MiB/s
>> - dmabuf:
>>      read:	234 MiB/s (capped by sample rate)
>>      write:	182 MiB/s
>>
>> Non-coherent reads with no cache sync + write-combine writes:
>> - fileio:
>>      read:	156 MiB/s
>>      write:	140 MiB/s
>> - dmabuf:
>>      read:	234 MiB/s (capped by sample rate)
>>      write:	210 MiB/s
>>
>>
>> A few things we can deduce from this:
>>
>> * Write-combine is not available on Zynq/ARM? If it was working, it
>> should give a better performance than the coherent mapping, but it
>> doesn't seem to do anything at all. At least it doesn't harm
>> performance.
> I'm not sure it's very relevant to this sort of streaming write.
> If you write a sequence of addresses then nothing stops them getting combined
> into a single write whether or not it is write-combining.

There is a difference at which point they can get combined. With 
write-combine they can be coalesced into a single transaction anywhere 
in the interconnect, as early as the CPU itself. Without write-cobmine 
the DDR controller might decide to combine them, but not earlier. This 
can make a difference especially if the write is a narrow write, i.e. 
the access size is smaller than the buswidth.

Lets say you do 32-bit writes, but your bus is 64 bits wide. With WC two 
32-bits can be combined into a 64-bit write. Without WC that is not 
possible and you are potentially not using the bus to its fullest 
capacity. This is especially true if the memory bus is wider than the 
widest access size of the CPU.