Hi Dmitry, Bogdan,
----- Original Message -----
From: "Dmitry Stogov"
Sent: Thursday, July 30, 2015
> Hi Bogdan,
>
> On Wed, Jul 29, 2015 at 5:22 PM, Andone, Bogdan <bogdan.andone@intel.com>
> wrote:
>
>> Hi Guys,
>>
>> My name is Bogdan Andone and I work for Intel in the area of SW
>> performance analysis and optimizations.
>> We would like to actively contribute to Zend PHP project and to involve
>> ourselves in finding new performance improvement opportunities based on
>> available and/or new hardware features.
>> I am still in the source code digesting phase but I had a look to the
>> fast_memcpy() implementation in opcache extension which uses SSE
>> intrinsics:
>>
>> If I am not wrong fast_memcpy() function is not currently used, as I
>> didn't find the "-msse4.2" gcc flag in the Makefile. I assume you
>> probably
>> didn't see any performance benefit so you preserved generic memcpy()
>> usage.
>>
>
> This is not SSE4.2 this is SSE2.
> Any X86_64 target implements SSE2, so it's enabled by default on x86_64
> systems (at least on Linux).
> It also may be enabled on x86 targets adding "-msse2" option.
Right, I was gonna say, I think that was a mistake, and all x86_64 should be
using it at least...
Of course, using anything newer that needs special options is nearly
useless, since I guess the vast majority aren't building themselves, but
using lowest-common-denominator repos. I had been wondering about speeding
up some other things, maybe taking advantage of SSE4.x (string stuff, I
don't know), but... like I said. Runtime checks would be awesome, but
except for the recent GCC, the intrinsics aren't available unless the
corresponding SSE option is enabled (lame!). So requires a separate
compilation unit. :-/
Of course I guess if the intrinsic maps simply to the instruction, could
just do it with inline asm, if wanted to do runtime CPU checking.
>> I would like to propose a slightly different implementation which uses
>> _mm_store_si128() instead of _mm_stream_si128(). This ensures that copied
>> memory is preserved in data cache, which is not bad as the interpreter
>> will
>> start to use this data without the need to go back one more time to
>> memory.
>> _mm_stream_si128() in the current implementation is intended to be used
>> for
>> stores where we want to avoid reading data into the cache and the cache
>> pollution; in opcache scenario it seems that preserving the data in cache
>> has a positive impact.
>>
>
> _mm_stream_si128() was used on purpose, to avoid CPU cache pollution,
> because data copied from SHM to process memory is not necessary used
> before
> eviction.
> By the way, I'm not completely sure. May be _mm_store_si128() can provide
> better result.
Interesting (that _stream was used on purpose). :-)
>> Running php-cgi -T10000 on WordPress4.1/index.php I see ~1% performance
>> increase for the new version of fast_memcpy() compared with the generic
>> memcpy(). Same result using a full load test with http_load on a Haswell
>> EP
>> 18 cores.
>>
>
> 1% is really big improvement.
> I'll able to check this only on next week (when back from vacation).
Well, he talks like he was comparing to *generic* memcpy(), so...? But not
sure how that would have been accomplished.
BTW guys, I was wondering before why fast_memcpy() only in this opcache
area? For the prefetch and/or cache pollution reasons?
Because shouldn't the library functions in glibc, etc. already be using
versions optimized for the CPU at runtime? So is generic memcpy() already
"fast?" (Other than overhead for a function call.)
>> Here is the proposed pull request:
>> https://github.com/php/php-src/pull/1446
>>
>> Related to the SW prefetching instructions in fast_memcpy()... they are
>> not really useful in this place. There benefit is almost negligible as
>> the
>> address requested for prefetch will be needed at the next iteration (few
>> cycles later), while the time needed to get data from RAM is >100 cycles
>> usually.. Nevertheless... they don't heart and it seems they still have a
>> very small benefit so I preserved the original instruction and I added a
>> new prefetch request for the destination pointer.
>>
>
> I also didn't see significant difference from software prefetching.
So how about prefetching "further"/more interations ahead...?
> Thanks. Dmitry.
>
>
>>
>> Hope it helps,
>> Bogdan
- Matt
--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php
----- Original Message -----
From: "Dmitry Stogov"
Sent: Thursday, July 30, 2015
> Hi Bogdan,
>
> On Wed, Jul 29, 2015 at 5:22 PM, Andone, Bogdan <bogdan.andone@intel.com>
> wrote:
>
>> Hi Guys,
>>
>> My name is Bogdan Andone and I work for Intel in the area of SW
>> performance analysis and optimizations.
>> We would like to actively contribute to Zend PHP project and to involve
>> ourselves in finding new performance improvement opportunities based on
>> available and/or new hardware features.
>> I am still in the source code digesting phase but I had a look to the
>> fast_memcpy() implementation in opcache extension which uses SSE
>> intrinsics:
>>
>> If I am not wrong fast_memcpy() function is not currently used, as I
>> didn't find the "-msse4.2" gcc flag in the Makefile. I assume you
>> probably
>> didn't see any performance benefit so you preserved generic memcpy()
>> usage.
>>
>
> This is not SSE4.2 this is SSE2.
> Any X86_64 target implements SSE2, so it's enabled by default on x86_64
> systems (at least on Linux).
> It also may be enabled on x86 targets adding "-msse2" option.
Right, I was gonna say, I think that was a mistake, and all x86_64 should be
using it at least...
Of course, using anything newer that needs special options is nearly
useless, since I guess the vast majority aren't building themselves, but
using lowest-common-denominator repos. I had been wondering about speeding
up some other things, maybe taking advantage of SSE4.x (string stuff, I
don't know), but... like I said. Runtime checks would be awesome, but
except for the recent GCC, the intrinsics aren't available unless the
corresponding SSE option is enabled (lame!). So requires a separate
compilation unit. :-/
Of course I guess if the intrinsic maps simply to the instruction, could
just do it with inline asm, if wanted to do runtime CPU checking.
>> I would like to propose a slightly different implementation which uses
>> _mm_store_si128() instead of _mm_stream_si128(). This ensures that copied
>> memory is preserved in data cache, which is not bad as the interpreter
>> will
>> start to use this data without the need to go back one more time to
>> memory.
>> _mm_stream_si128() in the current implementation is intended to be used
>> for
>> stores where we want to avoid reading data into the cache and the cache
>> pollution; in opcache scenario it seems that preserving the data in cache
>> has a positive impact.
>>
>
> _mm_stream_si128() was used on purpose, to avoid CPU cache pollution,
> because data copied from SHM to process memory is not necessary used
> before
> eviction.
> By the way, I'm not completely sure. May be _mm_store_si128() can provide
> better result.
Interesting (that _stream was used on purpose). :-)
>> Running php-cgi -T10000 on WordPress4.1/index.php I see ~1% performance
>> increase for the new version of fast_memcpy() compared with the generic
>> memcpy(). Same result using a full load test with http_load on a Haswell
>> EP
>> 18 cores.
>>
>
> 1% is really big improvement.
> I'll able to check this only on next week (when back from vacation).
Well, he talks like he was comparing to *generic* memcpy(), so...? But not
sure how that would have been accomplished.
BTW guys, I was wondering before why fast_memcpy() only in this opcache
area? For the prefetch and/or cache pollution reasons?
Because shouldn't the library functions in glibc, etc. already be using
versions optimized for the CPU at runtime? So is generic memcpy() already
"fast?" (Other than overhead for a function call.)
>> Here is the proposed pull request:
>> https://github.com/php/php-src/pull/1446
>>
>> Related to the SW prefetching instructions in fast_memcpy()... they are
>> not really useful in this place. There benefit is almost negligible as
>> the
>> address requested for prefetch will be needed at the next iteration (few
>> cycles later), while the time needed to get data from RAM is >100 cycles
>> usually.. Nevertheless... they don't heart and it seems they still have a
>> very small benefit so I preserved the original instruction and I added a
>> new prefetch request for the destination pointer.
>>
>
> I also didn't see significant difference from software prefetching.
So how about prefetching "further"/more interations ahead...?
> Thanks. Dmitry.
>
>
>>
>> Hope it helps,
>> Bogdan
- Matt
--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php