Dash Santosh
ca2a88c1b3
swscale/output: Implement yuv2nv12cx neon assembly
...
yuv2nv12cX_2_512_accurate_c: 3540.1 ( 1.00x)
yuv2nv12cX_2_512_accurate_neon: 408.0 ( 8.68x)
yuv2nv12cX_2_512_approximate_c: 3521.4 ( 1.00x)
yuv2nv12cX_2_512_approximate_neon: 409.2 ( 8.61x)
yuv2nv12cX_4_512_accurate_c: 4740.0 ( 1.00x)
yuv2nv12cX_4_512_accurate_neon: 604.4 ( 7.84x)
yuv2nv12cX_4_512_approximate_c: 4681.9 ( 1.00x)
yuv2nv12cX_4_512_approximate_neon: 603.3 ( 7.76x)
yuv2nv12cX_8_512_accurate_c: 7273.1 ( 1.00x)
yuv2nv12cX_8_512_accurate_neon: 1012.2 ( 7.19x)
yuv2nv12cX_8_512_approximate_c: 7223.0 ( 1.00x)
yuv2nv12cX_8_512_approximate_neon: 1015.8 ( 7.11x)
yuv2nv12cX_16_512_accurate_c: 13762.0 ( 1.00x)
yuv2nv12cX_16_512_accurate_neon: 1761.4 ( 7.81x)
yuv2nv12cX_16_512_approximate_c: 13884.0 ( 1.00x)
yuv2nv12cX_16_512_approximate_neon: 1766.8 ( 7.86x)
Benchmarked on:
Snapdragon(R) X Elite - X1E80100 - Qualcomm(R) Oryon(TM) CPU
3417 Mhz, 12 Core(s), 12 Logical Processor(s)
2025-08-12 09:05:00 +00:00
Logaprakash Ramajayam
49477972b7
swscale/aarch64/output: Implement neon assembly for yuv2planeX_10_c_template()
...
yuv2yuvX_8_2_0_512_accurate_c: 2213.4 ( 1.00x)
yuv2yuvX_8_2_0_512_accurate_neon: 147.5 (15.01x)
yuv2yuvX_8_2_0_512_approximate_c: 2203.9 ( 1.00x)
yuv2yuvX_8_2_0_512_approximate_neon: 154.1 (14.30x)
yuv2yuvX_8_2_16_512_accurate_c: 2147.2 ( 1.00x)
yuv2yuvX_8_2_16_512_accurate_neon: 150.8 (14.24x)
yuv2yuvX_8_2_16_512_approximate_c: 2149.7 ( 1.00x)
yuv2yuvX_8_2_16_512_approximate_neon: 146.8 (14.64x)
yuv2yuvX_8_2_32_512_accurate_c: 2078.9 ( 1.00x)
yuv2yuvX_8_2_32_512_accurate_neon: 139.0 (14.95x)
yuv2yuvX_8_2_32_512_approximate_c: 2083.7 ( 1.00x)
yuv2yuvX_8_2_32_512_approximate_neon: 140.5 (14.84x)
yuv2yuvX_8_2_48_512_accurate_c: 2010.7 ( 1.00x)
yuv2yuvX_8_2_48_512_accurate_neon: 138.2 (14.55x)
yuv2yuvX_8_2_48_512_approximate_c: 2012.6 ( 1.00x)
yuv2yuvX_8_2_48_512_approximate_neon: 141.2 (14.26x)
yuv2yuvX_10LE_16_0_512_accurate_c: 7874.1 ( 1.00x)
yuv2yuvX_10LE_16_0_512_accurate_neon: 831.6 ( 9.47x)
yuv2yuvX_10LE_16_0_512_approximate_c: 7918.1 ( 1.00x)
yuv2yuvX_10LE_16_0_512_approximate_neon: 836.1 ( 9.47x)
yuv2yuvX_10LE_16_16_512_accurate_c: 7630.9 ( 1.00x)
yuv2yuvX_10LE_16_16_512_accurate_neon: 804.5 ( 9.49x)
yuv2yuvX_10LE_16_16_512_approximate_c: 7724.7 ( 1.00x)
yuv2yuvX_10LE_16_16_512_approximate_neon: 808.6 ( 9.55x)
yuv2yuvX_10LE_16_32_512_accurate_c: 7436.4 ( 1.00x)
yuv2yuvX_10LE_16_32_512_accurate_neon: 780.4 ( 9.53x)
yuv2yuvX_10LE_16_32_512_approximate_c: 7366.7 ( 1.00x)
yuv2yuvX_10LE_16_32_512_approximate_neon: 780.5 ( 9.44x)
yuv2yuvX_10LE_16_48_512_accurate_c: 7099.9 ( 1.00x)
yuv2yuvX_10LE_16_48_512_accurate_neon: 761.0 ( 9.33x)
yuv2yuvX_10LE_16_48_512_approximate_c: 7097.6 ( 1.00x)
yuv2yuvX_10LE_16_48_512_approximate_neon: 754.6 ( 9.41x)
Benchmarked on:
Snapdragon(R) X Elite - X1E80100 - Qualcomm(R) Oryon(TM) CPU
3417 Mhz, 12 Core(s), 12 Logical Processor(s)
2025-08-12 09:05:00 +00:00
Timo Rothenpieler
262d41c804
all: fix typos found by codespell
2025-08-03 13:48:47 +02:00
Martin Storsjö
73f4668ef8
swscale: aarch64: Simplify the assignment of lumToYV12
...
We normally don't need else statements here; the common pattern
is to assign lower level SIMD implementations first, then
conditionally reassign higher level ones afterwards, if supported.
Signed-off-by: Martin Storsjö <martin@martin.st>
2025-03-10 14:03:58 +02:00
Krzysztof Pyrkosz
d765e5f043
swscale/aarch64: dotprod implementation of rgba32_to_Y
...
The idea is to split the 16 bit coefficients into lower and upper half,
invoke udot for the lower half, shift by 8, and follow by udot for the
upper half.
Benchmark on A78:
bgra_to_y_128_c: 682.0 ( 1.00x)
bgra_to_y_128_neon: 181.2 ( 3.76x)
bgra_to_y_128_dotprod: 117.8 ( 5.79x)
bgra_to_y_1080_c: 5742.5 ( 1.00x)
bgra_to_y_1080_neon: 1472.5 ( 3.90x)
bgra_to_y_1080_dotprod: 906.5 ( 6.33x)
bgra_to_y_1920_c: 10194.0 ( 1.00x)
bgra_to_y_1920_neon: 2589.8 ( 3.94x)
bgra_to_y_1920_dotprod: 1573.8 ( 6.48x)
Signed-off-by: Martin Storsjö <martin@martin.st>
2025-03-04 10:16:44 +02:00
Krzysztof Pyrkosz
38929b824b
swscale/aarch64: Refactor hscale_16_to_15__fs_4
...
This patch removes the use of stack for temporary state and replaces
interleaved ld4 loads with ld1.
Before/after:
A78
hscale_16_to_15__fs_4_dstW_8_neon: 86.8 ( 1.72x)
hscale_16_to_15__fs_4_dstW_24_neon: 147.5 ( 2.73x)
hscale_16_to_15__fs_4_dstW_128_neon: 614.0 ( 3.14x)
hscale_16_to_15__fs_4_dstW_144_neon: 680.5 ( 3.18x)
hscale_16_to_15__fs_4_dstW_256_neon: 1193.2 ( 3.19x)
hscale_16_to_15__fs_4_dstW_512_neon: 2305.0 ( 3.27x)
hscale_16_to_15__fs_4_dstW_8_neon: 86.0 ( 1.74x)
hscale_16_to_15__fs_4_dstW_24_neon: 106.8 ( 3.78x)
hscale_16_to_15__fs_4_dstW_128_neon: 404.0 ( 4.81x)
hscale_16_to_15__fs_4_dstW_144_neon: 451.8 ( 4.80x)
hscale_16_to_15__fs_4_dstW_256_neon: 760.5 ( 5.06x)
hscale_16_to_15__fs_4_dstW_512_neon: 1520.0 ( 5.01x)
A72
hscale_16_to_15__fs_4_dstW_8_neon: 156.8 ( 1.52x)
hscale_16_to_15__fs_4_dstW_24_neon: 217.8 ( 2.52x)
hscale_16_to_15__fs_4_dstW_128_neon: 906.8 ( 2.90x)
hscale_16_to_15__fs_4_dstW_144_neon: 1014.5 ( 2.91x)
hscale_16_to_15__fs_4_dstW_256_neon: 1751.5 ( 2.96x)
hscale_16_to_15__fs_4_dstW_512_neon: 3469.3 ( 2.97x)
hscale_16_to_15__fs_4_dstW_8_neon: 151.2 ( 1.54x)
hscale_16_to_15__fs_4_dstW_24_neon: 173.4 ( 3.15x)
hscale_16_to_15__fs_4_dstW_128_neon: 660.0 ( 3.98x)
hscale_16_to_15__fs_4_dstW_144_neon: 735.7 ( 4.00x)
hscale_16_to_15__fs_4_dstW_256_neon: 1273.5 ( 4.09x)
hscale_16_to_15__fs_4_dstW_512_neon: 2488.2 ( 4.16x)
Signed-off-by: Martin Storsjö <martin@martin.st>
2025-03-02 01:17:29 +02:00
Martin Storsjö
b137347278
aarch64: Fix a few misindented lines
...
Signed-off-by: Martin Storsjö <martin@martin.st>
2025-02-28 23:23:09 +02:00
Krzysztof Pyrkosz
b92577405b
swscale/aarch64/rgb2rgb_neon: Implemented {yuyv, uyvy}toyuv{420, 422}
...
A78:
uyvytoyuv420_neon: 6112.5 ( 6.96x)
uyvytoyuv422_neon: 6696.0 ( 6.32x)
yuyvtoyuv420_neon: 6113.0 ( 6.95x)
yuyvtoyuv422_neon: 6695.2 ( 6.31x)
A72:
uyvytoyuv420_neon: 9512.1 ( 6.09x)
uyvytoyuv422_neon: 9766.8 ( 6.32x)
yuyvtoyuv420_neon: 9639.1 ( 6.00x)
yuyvtoyuv422_neon: 9779.0 ( 6.03x)
A53:
uyvytoyuv420_neon: 12720.1 ( 9.10x)
uyvytoyuv422_neon: 14282.9 ( 6.71x)
yuyvtoyuv420_neon: 12637.4 ( 9.15x)
yuyvtoyuv422_neon: 14127.6 ( 6.77x)
Signed-off-by: Martin Storsjö <martin@martin.st>
2025-02-17 11:39:42 +02:00
Krzysztof Pyrkosz
64107e22f5
swscale/aarch64/rgb24toyv12: skip early right shift by 2
...
It's a minor improvement that shaves off 5-8% from the execution time.
Instead of shifting by 2 right away and by 7 soon after, shift by 9 one
time.
Times before and after:
A78:
rgb24toyv12_16_200_neon: 5366.8 ( 3.62x)
rgb24toyv12_128_60_neon: 13574.0 ( 3.34x)
rgb24toyv12_512_16_neon: 14463.8 ( 3.33x)
rgb24toyv12_1920_4_neon: 13508.2 ( 3.34x)
rgb24toyv12_1920_4_negstride_neon: 13525.0 ( 3.34x)
rgb24toyv12_16_200_neon: 5293.8 ( 3.66x)
rgb24toyv12_128_60_neon: 12955.0 ( 3.50x)
rgb24toyv12_512_16_neon: 13784.0 ( 3.50x)
rgb24toyv12_1920_4_neon: 12900.8 ( 3.49x)
rgb24toyv12_1920_4_negstride_neon: 12902.8 ( 3.49x)
A72:
rgb24toyv12_16_200_neon: 9695.8 ( 2.50x)
rgb24toyv12_128_60_neon: 20286.6 ( 2.70x)
rgb24toyv12_512_16_neon: 22276.6 ( 2.57x)
rgb24toyv12_1920_4_neon: 19154.1 ( 2.77x)
rgb24toyv12_1920_4_negstride_neon: 19055.1 ( 2.78x)
rgb24toyv12_16_200_neon: 9214.8 ( 2.65x)
rgb24toyv12_128_60_neon: 20731.5 ( 2.65x)
rgb24toyv12_512_16_neon: 21145.0 ( 2.70x)
rgb24toyv12_1920_4_neon: 17586.5 ( 2.99x)
rgb24toyv12_1920_4_negstride_neon: 17571.0 ( 2.98x)
A53:
rgb24toyv12_16_200_neon: 12880.4 ( 3.76x)
rgb24toyv12_128_60_neon: 27776.3 ( 3.94x)
rgb24toyv12_512_16_neon: 29411.3 ( 3.94x)
rgb24toyv12_1920_4_neon: 27253.1 ( 3.98x)
rgb24toyv12_1920_4_negstride_neon: 27474.3 ( 3.95x)
rgb24toyv12_16_200_neon: 12196.3 ( 3.95x)
rgb24toyv12_128_60_neon: 26943.1 ( 4.07x)
rgb24toyv12_512_16_neon: 28642.3 ( 4.07x)
rgb24toyv12_1920_4_neon: 26676.6 ( 4.08x)
rgb24toyv12_1920_4_negstride_neon: 26713.8 ( 4.07x)
Signed-off-by: Martin Storsjö <martin@martin.st>
2025-02-17 10:49:41 +02:00
Krzysztof Pyrkosz
c85a748979
swscale/aarch64/rgb2rgb: Implemented NEON shuf routines
...
The key idea is to pass the pre-generated tables to the TBL instruction
and churn through the data 16 bytes at a time. The remaining 4 elements
are handled with a specialized block located at the end of the routine.
The 3210 variant can be implemented using rev32, but surprisingly it is
slower than the generic TBL on A78, but much faster on A72.
There may be some room for improvement. Possibly instead of handling
last 8 and then 4 bytes separately, we can load these 4 into {v0.s}[2]
and process along with the last 8 bytes.
Speeds measured with checkasm --test=sw_rgb --bench --runs=10 | grep shuf
- A78
shuffle_bytes_0321_c: 75.5 ( 1.00x)
shuffle_bytes_0321_neon: 26.5 ( 2.85x)
shuffle_bytes_1203_c: 136.2 ( 1.00x)
shuffle_bytes_1203_neon: 27.2 ( 5.00x)
shuffle_bytes_1230_c: 135.5 ( 1.00x)
shuffle_bytes_1230_neon: 28.0 ( 4.84x)
shuffle_bytes_2013_c: 138.8 ( 1.00x)
shuffle_bytes_2013_neon: 22.0 ( 6.31x)
shuffle_bytes_2103_c: 76.5 ( 1.00x)
shuffle_bytes_2103_neon: 20.5 ( 3.73x)
shuffle_bytes_2130_c: 137.5 ( 1.00x)
shuffle_bytes_2130_neon: 28.0 ( 4.91x)
shuffle_bytes_3012_c: 138.2 ( 1.00x)
shuffle_bytes_3012_neon: 21.5 ( 6.43x)
shuffle_bytes_3102_c: 138.2 ( 1.00x)
shuffle_bytes_3102_neon: 27.2 ( 5.07x)
shuffle_bytes_3210_c: 138.0 ( 1.00x)
shuffle_bytes_3210_neon: 22.0 ( 6.27x)
shuf3210 using rev32
shuffle_bytes_3210_c: 139.0 ( 1.00x)
shuffle_bytes_3210_neon: 28.5 ( 4.88x)
- A72
shuffle_bytes_0321_c: 120.0 ( 1.00x)
shuffle_bytes_0321_neon: 36.0 ( 3.33x)
shuffle_bytes_1203_c: 188.2 ( 1.00x)
shuffle_bytes_1203_neon: 37.8 ( 4.99x)
shuffle_bytes_1230_c: 195.0 ( 1.00x)
shuffle_bytes_1230_neon: 36.0 ( 5.42x)
shuffle_bytes_2013_c: 195.8 ( 1.00x)
shuffle_bytes_2013_neon: 43.5 ( 4.50x)
shuffle_bytes_2103_c: 117.2 ( 1.00x)
shuffle_bytes_2103_neon: 53.5 ( 2.19x)
shuffle_bytes_2130_c: 203.2 ( 1.00x)
shuffle_bytes_2130_neon: 37.8 ( 5.38x)
shuffle_bytes_3012_c: 183.8 ( 1.00x)
shuffle_bytes_3012_neon: 46.8 ( 3.93x)
shuffle_bytes_3102_c: 180.8 ( 1.00x)
shuffle_bytes_3102_neon: 37.8 ( 4.79x)
shuffle_bytes_3210_c: 195.8 ( 1.00x)
shuffle_bytes_3210_neon: 37.8 ( 5.19x)
shuf3210 using rev32
shuffle_bytes_3210_c: 194.8 ( 1.00x)
shuffle_bytes_3210_neon: 30.8 ( 6.33x)
- x13s:
shuffle_bytes_0321_c: 49.4 ( 1.00x)
shuffle_bytes_0321_neon: 18.1 ( 2.72x)
shuffle_bytes_1203_c: 98.4 ( 1.00x)
shuffle_bytes_1203_neon: 18.4 ( 5.35x)
shuffle_bytes_1230_c: 97.4 ( 1.00x)
shuffle_bytes_1230_neon: 19.1 ( 5.09x)
shuffle_bytes_2013_c: 101.4 ( 1.00x)
shuffle_bytes_2013_neon: 16.9 ( 6.01x)
shuffle_bytes_2103_c: 53.9 ( 1.00x)
shuffle_bytes_2103_neon: 13.9 ( 3.88x)
shuffle_bytes_2130_c: 100.9 ( 1.00x)
shuffle_bytes_2130_neon: 19.1 ( 5.27x)
shuffle_bytes_3012_c: 97.4 ( 1.00x)
shuffle_bytes_3012_neon: 17.1 ( 5.69x)
shuffle_bytes_3102_c: 100.9 ( 1.00x)
shuffle_bytes_3102_neon: 19.1 ( 5.27x)
shuffle_bytes_3210_c: 100.6 ( 1.00x)
shuffle_bytes_3210_neon: 16.9 ( 5.96x)
shuf3210 using rev32
shuffle_bytes_3210_c: 100.6 ( 1.00x)
shuffle_bytes_3210_neon: 18.6 ( 5.40x)
Signed-off-by: Martin Storsjö <martin@martin.st>
2025-02-07 12:54:55 +02:00
Krzysztof Pyrkosz
e25a19fc7c
swscale/aarch64/output.S: refactor ff_yuv2plane1_8_neon
...
The benchmarks (before vs after) were gathered using
./tests/checkasm/checkasm --test=sw_scale --bench --runs=6 | grep yuv2yuv1
A78 before:
yuv2yuv1_0_512_accurate_c: 2039.5 ( 1.00x)
yuv2yuv1_0_512_accurate_neon: 385.5 ( 5.29x)
yuv2yuv1_0_512_approximate_c: 2110.5 ( 1.00x)
yuv2yuv1_0_512_approximate_neon: 385.5 ( 5.47x)
yuv2yuv1_3_512_accurate_c: 2061.2 ( 1.00x)
yuv2yuv1_3_512_accurate_neon: 381.2 ( 5.41x)
yuv2yuv1_3_512_approximate_c: 2099.2 ( 1.00x)
yuv2yuv1_3_512_approximate_neon: 381.2 ( 5.51x)
yuv2yuv1_8_512_accurate_c: 2054.2 ( 1.00x)
yuv2yuv1_8_512_accurate_neon: 385.5 ( 5.33x)
yuv2yuv1_8_512_approximate_c: 2112.2 ( 1.00x)
yuv2yuv1_8_512_approximate_neon: 385.5 ( 5.48x)
yuv2yuv1_11_512_accurate_c: 2036.0 ( 1.00x)
yuv2yuv1_11_512_accurate_neon: 381.2 ( 5.34x)
yuv2yuv1_11_512_approximate_c: 2115.0 ( 1.00x)
yuv2yuv1_11_512_approximate_neon: 381.2 ( 5.55x)
yuv2yuv1_16_512_accurate_c: 2066.5 ( 1.00x)
yuv2yuv1_16_512_accurate_neon: 385.5 ( 5.36x)
yuv2yuv1_16_512_approximate_c: 2100.8 ( 1.00x)
yuv2yuv1_16_512_approximate_neon: 385.5 ( 5.45x)
yuv2yuv1_19_512_accurate_c: 2059.8 ( 1.00x)
yuv2yuv1_19_512_accurate_neon: 381.2 ( 5.40x)
yuv2yuv1_19_512_approximate_c: 2102.8 ( 1.00x)
yuv2yuv1_19_512_approximate_neon: 381.2 ( 5.52x)
After:
yuv2yuv1_0_512_accurate_c: 2206.0 ( 1.00x)
yuv2yuv1_0_512_accurate_neon: 139.2 (15.84x)
yuv2yuv1_0_512_approximate_c: 2050.0 ( 1.00x)
yuv2yuv1_0_512_approximate_neon: 139.2 (14.72x)
yuv2yuv1_3_512_accurate_c: 2205.2 ( 1.00x)
yuv2yuv1_3_512_accurate_neon: 138.0 (15.98x)
yuv2yuv1_3_512_approximate_c: 2052.5 ( 1.00x)
yuv2yuv1_3_512_approximate_neon: 138.0 (14.87x)
yuv2yuv1_8_512_accurate_c: 2171.0 ( 1.00x)
yuv2yuv1_8_512_accurate_neon: 139.2 (15.59x)
yuv2yuv1_8_512_approximate_c: 2064.2 ( 1.00x)
yuv2yuv1_8_512_approximate_neon: 139.2 (14.82x)
yuv2yuv1_11_512_accurate_c: 2164.8 ( 1.00x)
yuv2yuv1_11_512_accurate_neon: 138.0 (15.69x)
yuv2yuv1_11_512_approximate_c: 2048.8 ( 1.00x)
yuv2yuv1_11_512_approximate_neon: 138.0 (14.85x)
yuv2yuv1_16_512_accurate_c: 2154.5 ( 1.00x)
yuv2yuv1_16_512_accurate_neon: 139.2 (15.47x)
yuv2yuv1_16_512_approximate_c: 2047.2 ( 1.00x)
yuv2yuv1_16_512_approximate_neon: 139.2 (14.70x)
yuv2yuv1_19_512_accurate_c: 2144.5 ( 1.00x)
yuv2yuv1_19_512_accurate_neon: 138.0 (15.54x)
yuv2yuv1_19_512_approximate_c: 2046.0 ( 1.00x)
yuv2yuv1_19_512_approximate_neon: 138.0 (14.83x)
A72 before:
yuv2yuv1_0_512_accurate_c: 3779.8 ( 1.00x)
yuv2yuv1_0_512_accurate_neon: 527.8 ( 7.16x)
yuv2yuv1_0_512_approximate_c: 4128.2 ( 1.00x)
yuv2yuv1_0_512_approximate_neon: 528.2 ( 7.81x)
yuv2yuv1_3_512_accurate_c: 3836.2 ( 1.00x)
yuv2yuv1_3_512_accurate_neon: 527.0 ( 7.28x)
yuv2yuv1_3_512_approximate_c: 3991.0 ( 1.00x)
yuv2yuv1_3_512_approximate_neon: 526.8 ( 7.58x)
yuv2yuv1_8_512_accurate_c: 3732.8 ( 1.00x)
yuv2yuv1_8_512_accurate_neon: 525.5 ( 7.10x)
yuv2yuv1_8_512_approximate_c: 4060.0 ( 1.00x)
yuv2yuv1_8_512_approximate_neon: 527.0 ( 7.70x)
yuv2yuv1_11_512_accurate_c: 3836.2 ( 1.00x)
yuv2yuv1_11_512_accurate_neon: 530.0 ( 7.24x)
yuv2yuv1_11_512_approximate_c: 4014.0 ( 1.00x)
yuv2yuv1_11_512_approximate_neon: 530.0 ( 7.57x)
yuv2yuv1_16_512_accurate_c: 3726.2 ( 1.00x)
yuv2yuv1_16_512_accurate_neon: 525.5 ( 7.09x)
yuv2yuv1_16_512_approximate_c: 4114.2 ( 1.00x)
yuv2yuv1_16_512_approximate_neon: 526.2 ( 7.82x)
yuv2yuv1_19_512_accurate_c: 3812.2 ( 1.00x)
yuv2yuv1_19_512_accurate_neon: 530.0 ( 7.19x)
yuv2yuv1_19_512_approximate_c: 4012.2 ( 1.00x)
yuv2yuv1_19_512_approximate_neon: 530.0 ( 7.57x)
After:
yuv2yuv1_0_512_accurate_c: 3716.8 ( 1.00x)
yuv2yuv1_0_512_accurate_neon: 215.1 (17.28x)
yuv2yuv1_0_512_approximate_c: 3877.8 ( 1.00x)
yuv2yuv1_0_512_approximate_neon: 222.8 (17.40x)
yuv2yuv1_3_512_accurate_c: 3717.1 ( 1.00x)
yuv2yuv1_3_512_accurate_neon: 217.8 (17.06x)
yuv2yuv1_3_512_approximate_c: 3801.6 ( 1.00x)
yuv2yuv1_3_512_approximate_neon: 220.3 (17.25x)
yuv2yuv1_8_512_accurate_c: 3716.6 ( 1.00x)
yuv2yuv1_8_512_accurate_neon: 213.8 (17.38x)
yuv2yuv1_8_512_approximate_c: 3831.8 ( 1.00x)
yuv2yuv1_8_512_approximate_neon: 218.1 (17.57x)
yuv2yuv1_11_512_accurate_c: 3717.1 ( 1.00x)
yuv2yuv1_11_512_accurate_neon: 219.1 (16.97x)
yuv2yuv1_11_512_approximate_c: 3801.6 ( 1.00x)
yuv2yuv1_11_512_approximate_neon: 216.1 (17.59x)
yuv2yuv1_16_512_accurate_c: 3716.6 ( 1.00x)
yuv2yuv1_16_512_accurate_neon: 213.6 (17.40x)
yuv2yuv1_16_512_approximate_c: 3831.6 ( 1.00x)
yuv2yuv1_16_512_approximate_neon: 215.1 (17.82x)
yuv2yuv1_19_512_accurate_c: 3717.1 ( 1.00x)
yuv2yuv1_19_512_accurate_neon: 223.8 (16.61x)
yuv2yuv1_19_512_approximate_c: 3801.6 ( 1.00x)
yuv2yuv1_19_512_approximate_neon: 219.1 (17.35x)
x13s before:
yuv2yuv1_0_512_accurate_c: 1435.1 ( 1.00x)
yuv2yuv1_0_512_accurate_neon: 221.1 ( 6.49x)
yuv2yuv1_0_512_approximate_c: 1405.4 ( 1.00x)
yuv2yuv1_0_512_approximate_neon: 219.1 ( 6.41x)
yuv2yuv1_3_512_accurate_c: 1418.6 ( 1.00x)
yuv2yuv1_3_512_accurate_neon: 215.9 ( 6.57x)
yuv2yuv1_3_512_approximate_c: 1405.9 ( 1.00x)
yuv2yuv1_3_512_approximate_neon: 224.1 ( 6.27x)
yuv2yuv1_8_512_accurate_c: 1433.9 ( 1.00x)
yuv2yuv1_8_512_accurate_neon: 218.6 ( 6.56x)
yuv2yuv1_8_512_approximate_c: 1412.9 ( 1.00x)
yuv2yuv1_8_512_approximate_neon: 218.9 ( 6.46x)
yuv2yuv1_11_512_accurate_c: 1449.1 ( 1.00x)
yuv2yuv1_11_512_accurate_neon: 217.6 ( 6.66x)
yuv2yuv1_11_512_approximate_c: 1410.9 ( 1.00x)
yuv2yuv1_11_512_approximate_neon: 221.1 ( 6.38x)
yuv2yuv1_16_512_accurate_c: 1402.1 ( 1.00x)
yuv2yuv1_16_512_accurate_neon: 214.6 ( 6.53x)
yuv2yuv1_16_512_approximate_c: 1422.4 ( 1.00x)
yuv2yuv1_16_512_approximate_neon: 222.9 ( 6.38x)
yuv2yuv1_19_512_accurate_c: 1421.6 ( 1.00x)
yuv2yuv1_19_512_accurate_neon: 217.4 ( 6.54x)
yuv2yuv1_19_512_approximate_c: 1421.6 ( 1.00x)
yuv2yuv1_19_512_approximate_neon: 221.4 ( 6.42x)
After:
yuv2yuv1_0_512_accurate_c: 1413.6 ( 1.00x)
yuv2yuv1_0_512_accurate_neon: 80.6 (17.53x)
yuv2yuv1_0_512_approximate_c: 1455.6 ( 1.00x)
yuv2yuv1_0_512_approximate_neon: 80.6 (18.05x)
yuv2yuv1_3_512_accurate_c: 1429.1 ( 1.00x)
yuv2yuv1_3_512_accurate_neon: 77.4 (18.47x)
yuv2yuv1_3_512_approximate_c: 1462.6 ( 1.00x)
yuv2yuv1_3_512_approximate_neon: 80.6 (18.14x)
yuv2yuv1_8_512_accurate_c: 1425.4 ( 1.00x)
yuv2yuv1_8_512_accurate_neon: 77.9 (18.30x)
yuv2yuv1_8_512_approximate_c: 1436.6 ( 1.00x)
yuv2yuv1_8_512_approximate_neon: 80.9 (17.76x)
yuv2yuv1_11_512_accurate_c: 1429.4 ( 1.00x)
yuv2yuv1_11_512_accurate_neon: 76.1 (18.78x)
yuv2yuv1_11_512_approximate_c: 1447.1 ( 1.00x)
yuv2yuv1_11_512_approximate_neon: 78.4 (18.46x)
yuv2yuv1_16_512_accurate_c: 1439.9 ( 1.00x)
yuv2yuv1_16_512_accurate_neon: 77.6 (18.55x)
yuv2yuv1_16_512_approximate_c: 1422.1 ( 1.00x)
yuv2yuv1_16_512_approximate_neon: 78.1 (18.20x)
yuv2yuv1_19_512_accurate_c: 1447.1 ( 1.00x)
yuv2yuv1_19_512_accurate_neon: 78.1 (18.52x)
yuv2yuv1_19_512_approximate_c: 1474.4 ( 1.00x)
yuv2yuv1_19_512_approximate_neon: 78.1 (18.87x)
Signed-off-by: Martin Storsjö <martin@martin.st>
2025-02-07 12:05:06 +02:00
Ramiro Polla
ca889b1328
swscale/aarch64: add neon {lum,chr}ConvertRange16
...
aarch64 A55:
chrRangeFromJpeg16_1920_c: 32684.2
chrRangeFromJpeg16_1920_neon: 8431.2 (3.88x)
chrRangeToJpeg16_1920_c: 24996.8
chrRangeToJpeg16_1920_neon: 9395.0 (2.66x)
lumRangeFromJpeg16_1920_c: 17305.2
lumRangeFromJpeg16_1920_neon: 4586.5 (3.77x)
lumRangeToJpeg16_1920_c: 21144.8
lumRangeToJpeg16_1920_neon: 5069.8 (4.17x)
aarch64 A76:
chrRangeFromJpeg16_1920_c: 11523.8
chrRangeFromJpeg16_1920_neon: 3367.5 (3.42x)
chrRangeToJpeg16_1920_c: 11655.2
chrRangeToJpeg16_1920_neon: 4087.2 (2.85x)
lumRangeFromJpeg16_1920_c: 5762.0
lumRangeFromJpeg16_1920_neon: 1815.8 (3.17x)
lumRangeToJpeg16_1920_c: 5946.2
lumRangeToJpeg16_1920_neon: 2148.2 (2.77x)
2024-12-05 21:10:29 +01:00
Ramiro Polla
6fe4a4ffb6
swscale/aarch64/range_convert: update neon range_convert functions to new API
...
aarch64 A55:
chrRangeFromJpeg8_1920_c: 28835.2 (1.00x)
chrRangeFromJpeg8_1920_neon: 5313.9 (5.43x) 5308.4 (5.43x)
chrRangeToJpeg8_1920_c: 23074.7 (1.00x)
chrRangeToJpeg8_1920_neon: 5551.3 (4.16x) 5549.2 (4.16x)
lumRangeFromJpeg8_1920_c: 15389.7 (1.00x)
lumRangeFromJpeg8_1920_neon: 3152.3 (4.88x) 3147.7 (4.89x)
lumRangeToJpeg8_1920_c: 19227.8 (1.00x)
lumRangeToJpeg8_1920_neon: 3628.7 (5.30x) 3630.2 (5.30x)
aarch64 A76:
chrRangeFromJpeg8_1920_c: 6324.4 (1.00x)
chrRangeFromJpeg8_1920_neon: 2344.5 (2.70x) 2304.2 (2.74x)
chrRangeToJpeg8_1920_c: 9656.0 (1.00x)
chrRangeToJpeg8_1920_neon: 2824.2 (3.42x) 2794.2 (3.46x)
lumRangeFromJpeg8_1920_c: 4422.0 (1.00x)
lumRangeFromJpeg8_1920_neon: 1104.5 (4.00x) 1106.2 (4.00x)
lumRangeToJpeg8_1920_c: 5949.1 (1.00x)
lumRangeToJpeg8_1920_neon: 1329.8 (4.47x) 1328.2 (4.48x)
2024-12-05 21:10:29 +01:00
Ramiro Polla
384fe39623
swscale/range_convert: fix mpeg ranges in yuv range conversion for non-8-bit pixel formats
...
There is an issue with the constants used in YUV to YUV range conversion,
where the upper bound is not respected when converting to mpeg range.
With this commit, the constants are calculated at runtime, depending on
the bit depth. This approach also allows us to more easily understand how
the constants are derived.
For bit depths <= 14, the number of fixed point bits has been set to 14
for all conversions, to simplify the code.
For bit depths > 14, the number of fixed points bits has been raised and
set to 18, to allow for the conversion to be accurate enough for the mpeg
range to be respected.
The convert functions now take the conversion constants (coeff and offset)
as function arguments.
For bit depths <= 14, coeff is unsigned 16-bit and offset is 32-bit.
For bit depths > 14, coeff is unsigned 32-bit and offset is 64-bit.
x86_64:
chrRangeFromJpeg8_1920_c: 2127.4 2125.0 (1.00x)
chrRangeFromJpeg16_1920_c: 2325.2 2127.2 (1.09x)
chrRangeToJpeg8_1920_c: 3166.9 3168.7 (1.00x)
chrRangeToJpeg16_1920_c: 2152.4 3164.8 (0.68x)
lumRangeFromJpeg8_1920_c: 1263.0 1302.5 (0.97x)
lumRangeFromJpeg16_1920_c: 1080.5 1299.2 (0.83x)
lumRangeToJpeg8_1920_c: 1886.8 2112.2 (0.89x)
lumRangeToJpeg16_1920_c: 1077.0 1906.5 (0.56x)
aarch64 A55:
chrRangeFromJpeg8_1920_c: 28835.2 28835.6 (1.00x)
chrRangeFromJpeg16_1920_c: 28839.8 32680.8 (0.88x)
chrRangeToJpeg8_1920_c: 23074.7 23075.4 (1.00x)
chrRangeToJpeg16_1920_c: 17318.9 24996.0 (0.69x)
lumRangeFromJpeg8_1920_c: 15389.7 15384.5 (1.00x)
lumRangeFromJpeg16_1920_c: 15388.2 17306.7 (0.89x)
lumRangeToJpeg8_1920_c: 19227.8 19226.6 (1.00x)
lumRangeToJpeg16_1920_c: 15387.0 21146.3 (0.73x)
aarch64 A76:
chrRangeFromJpeg8_1920_c: 6324.4 6268.1 (1.01x)
chrRangeFromJpeg16_1920_c: 6339.9 11521.5 (0.55x)
chrRangeToJpeg8_1920_c: 9656.0 9612.8 (1.00x)
chrRangeToJpeg16_1920_c: 6340.4 11651.8 (0.54x)
lumRangeFromJpeg8_1920_c: 4422.0 4420.8 (1.00x)
lumRangeFromJpeg16_1920_c: 4420.9 5762.0 (0.77x)
lumRangeToJpeg8_1920_c: 5949.1 5977.5 (1.00x)
lumRangeToJpeg16_1920_c: 4446.8 5946.2 (0.75x)
NOTE: all simd optimizations for range_convert have been disabled.
they will be re-enabled when they are fixed for each architecture.
NOTE2: the same issue still exists in rgb2yuv conversions, which is not
addressed in this commit.
2024-12-05 21:10:29 +01:00
Ramiro Polla
58bcdeb742
swscale/aarch64/range_convert: saturate output instead of limiting input
...
aarch64 A55:
chrRangeFromJpeg8_1920_c: 28836.2 (1.00x)
chrRangeFromJpeg8_1920_neon: 5312.6 (5.43x) 5313.9 (5.43x)
chrRangeToJpeg8_1920_c: 44196.2 (1.00x)
chrRangeToJpeg8_1920_neon: 6034.6 (7.32x) 5551.3 (7.96x)
lumRangeFromJpeg8_1920_c: 15388.5 (1.00x)
lumRangeFromJpeg8_1920_neon: 3150.7 (4.88x) 3152.3 (4.88x)
lumRangeToJpeg8_1920_c: 23069.7 (1.00x)
lumRangeToJpeg8_1920_neon: 3873.2 (5.96x) 3628.7 (6.36x)
aarch64 A76:
chrRangeFromJpeg8_1920_c: 6334.7 (1.00x)
chrRangeFromJpeg8_1920_neon: 2264.5 (2.80x) 2344.5 (2.70x)
chrRangeToJpeg8_1920_c: 11474.5 (1.00x)
chrRangeToJpeg8_1920_neon: 2646.5 (4.34x) 2824.2 (4.06x)
lumRangeFromJpeg8_1920_c: 4453.2 (1.00x)
lumRangeFromJpeg8_1920_neon: 1104.8 (4.03x) 1104.5 (4.03x)
lumRangeToJpeg8_1920_c: 6645.0 (1.00x)
lumRangeToJpeg8_1920_neon: 1310.5 (5.07x) 1329.8 (5.00x)
2024-12-05 21:10:29 +01:00
Ramiro Polla
2d1358a84d
swscale/range_convert: saturate output instead of limiting input
...
For bit depths <= 14, the result is saturated to 15 bits.
For bit depths > 14, the result is saturated to 19 bits.
x86_64:
chrRangeFromJpeg8_1920_c: 2126.5 2127.4 (1.00x)
chrRangeFromJpeg16_1920_c: 2331.4 2325.2 (1.00x)
chrRangeToJpeg8_1920_c: 3163.0 3166.9 (1.00x)
chrRangeToJpeg16_1920_c: 3163.7 2152.4 (1.47x)
lumRangeFromJpeg8_1920_c: 1262.2 1263.0 (1.00x)
lumRangeFromJpeg16_1920_c: 1079.5 1080.5 (1.00x)
lumRangeToJpeg8_1920_c: 1860.5 1886.8 (0.99x)
lumRangeToJpeg16_1920_c: 1910.2 1077.0 (1.77x)
aarch64 A55:
chrRangeFromJpeg8_1920_c: 28836.2 28835.2 (1.00x)
chrRangeFromJpeg16_1920_c: 28840.1 28839.8 (1.00x)
chrRangeToJpeg8_1920_c: 44196.2 23074.7 (1.92x)
chrRangeToJpeg16_1920_c: 36527.3 17318.9 (2.11x)
lumRangeFromJpeg8_1920_c: 15388.5 15389.7 (1.00x)
lumRangeFromJpeg16_1920_c: 15389.3 15388.2 (1.00x)
lumRangeToJpeg8_1920_c: 23069.7 19227.8 (1.20x)
lumRangeToJpeg16_1920_c: 19227.8 15387.0 (1.25x)
aarch64 A76:
chrRangeFromJpeg8_1920_c: 6334.7 6324.4 (1.00x)
chrRangeFromJpeg16_1920_c: 6336.0 6339.9 (1.00x)
chrRangeToJpeg8_1920_c: 11474.5 9656.0 (1.19x)
chrRangeToJpeg16_1920_c: 9640.5 6340.4 (1.52x)
lumRangeFromJpeg8_1920_c: 4453.2 4422.0 (1.01x)
lumRangeFromJpeg16_1920_c: 4414.2 4420.9 (1.00x)
lumRangeToJpeg8_1920_c: 6645.0 5949.1 (1.12x)
lumRangeToJpeg16_1920_c: 6005.2 4446.8 (1.35x)
NOTE: all simd optimizations for range_convert have been disabled
except for x86, which already had the same behaviour.
they will be re-enabled when they are fixed for each architecture.
2024-12-05 21:10:29 +01:00
Niklas Haas
2d077f9acd
swscale/internal: group user-facing options together
...
This is a preliminary step to separating these into a new struct. This
commit contains no functional changes, it is a pure search-and-replace.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
2024-11-21 12:49:56 +01:00
Ramiro Polla
f7ee0195df
swscale/range_convert: drop redundant conditionals from arch-specific init functions
...
These conditions are already checked for in the main init function.
2024-10-27 13:20:56 +01:00
Ramiro Polla
7728b3357d
swscale/range_convert: call arch-specific init functions from main init function
...
This commit also fixes the issue that the call to ff_sws_init_range_convert()
from sws_init_swscale() was not setting up the arch-specific optimizations.
2024-10-27 13:20:56 +01:00
Niklas Haas
67adb30322
swscale: rename SwsContext to SwsInternal
...
And preserve the public SwsContext as separate name. The motivation here
is that I want to turn SwsContext into a public struct, while keeping the
internal implementation hidden. Additionally, I also want to be able to
use multiple internal implementations, e.g. for GPU devices.
This commit does not include any functional changes. For the most part, it is
a simple rename. The only complications arise from the public facing API
functions, which preserve their current type (and hence require an additional
unwrapping step internally), and the checkasm test framework, which directly
accesses SwsInternal.
For consistency, the affected functions that need to maintain a distionction
have generally been changed to refer to the SwsContext as *sws, and the
SwsInternal as *c.
In an upcoming commit, I will provide a backing definition for the public
SwsContext, and update `sws_internal()` to dereference the internal struct
instead of merely casting it.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
2024-10-24 22:50:00 +02:00
Martin Storsjö
b9145fcab2
swscale: Fix aarch64 and i386 compilation failures
...
This unbreaks builds after c1a0e65763 ,
which broke with errors like
src/libswscale/aarch64/rgb2rgb.c:66:25: error: incompatible function pointer types assigning to 'void (*)(const uint8_t *, uint8_t *, uint8_t *, uint8_t *, int, int, int, int, int, const int32_t *)' (aka 'void (*)(const unsigned char *, unsigned char *, unsigned char *, unsigned char *, int, int, int, int, int, const int *)') from 'void (const uint8_t *, uint8_t *, uint8_t *, uint8_t *, int, int, int, int, int, int32_t *)' (aka 'void (const unsigned char *, unsigned char *, unsigned char *, unsigned char *, int, int, int, int, int, int *)') [-Wincompatible-function-pointer-types]
66 | ff_rgb24toyv12 = rgb24toyv12;
| ^ ~~~~~~~~~~~
and
src/libswscale/aarch64/swscale_unscaled.c:213:29: error: incompatible function pointer types assigning to 'SwsFunc' (aka 'int (*)(struct SwsContext *, const unsigned char *const *, const int *, int, int, unsigned char *const *, const int *)') from 'int (SwsContext *, const uint8_t *const *, const int *, int, int, const uint8_t **, const int *)' (aka 'int (struct SwsContext *, const unsigned char *const *, const int *, int, int, const unsigned char **, const int *)') [-Wincompatible-function-pointer-types]
213 | c->convert_unscaled = nv24_to_yuv420p_neon_wrapper;
| ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Signed-off-by: Martin Storsjö <martin@martin.st>
2024-10-08 09:29:07 +03:00
Niklas Haas
c1a0e65763
swscale/internal: constify SwsFunc
...
I want to move away from having random leaf processing functions mutate
plane pointers, and while we're at it, we might as well make the strides
and tables const as well.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
2024-10-07 19:51:34 +02:00
Zhao Zhili
e18b46d95f
swscale/aarch64: Fix rgb24toyv12 only works with aligned width
...
Since c0666d8b , rgb24toyv12 is broken for width non-aligned to 16.
Add a simple wrapper to handle the non-aligned part.
Co-authored-by: johzzy <hellojinqiang@gmail.com>
Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
2024-09-24 10:24:14 +08:00
Ramiro Polla
c0666d8bed
swscale/aarch64/rgb2rgb: add neon implementation for rgb24toyv12
...
A55 A76
rgb24toyv12_16_200_c: 36890.6 17275.5
rgb24toyv12_16_200_neon: 12460.1 ( 2.96x) 5360.8 ( 3.22x)
rgb24toyv12_128_60_c: 83205.1 39884.8
rgb24toyv12_128_60_neon: 27468.4 ( 3.03x) 13552.5 ( 2.94x)
rgb24toyv12_512_16_c: 88111.6 42346.8
rgb24toyv12_512_16_neon: 29126.6 ( 3.03x) 14411.2 ( 2.94x)
rgb24toyv12_1920_4_c: 82068.1 39620.0
rgb24toyv12_1920_4_neon: 27011.6 ( 3.04x) 13492.2 ( 2.94x)
2024-09-06 23:11:13 +02:00
Ramiro Polla
d8848325a6
swscale/aarch64/rgb2rgb: add deinterleaveBytes neon implementation
...
A55 A76
deinterleave_bytes_c: 70342.0 34497.5
deinterleave_bytes_neon: 21594.5 ( 3.26x) 5535.2 ( 6.23x)
deinterleave_bytes_aligned_c: 71340.8 34651.2
deinterleave_bytes_aligned_neon: 8616.8 ( 8.28x) 3996.2 ( 8.67x)
2024-09-06 23:05:09 +02:00
Ramiro Polla
420d443600
swscale/aarch64: cosmetics fix (spaces inside curly braces)
2024-08-26 11:07:49 +02:00
Ramiro Polla
52887683e9
swscale/aarch64: add nv24/nv42 to yuv420p unscaled converter
...
A55 A76
nv24_yuv420p_128_c: 4956.1 1267.0
nv24_yuv420p_128_neon: 3109.1 ( 1.59x) 640.0 ( 1.98x)
nv24_yuv420p_1920_c: 35728.4 11736.2
nv24_yuv420p_1920_neon: 8011.1 ( 4.46x) 2436.0 ( 4.82x)
nv42_yuv420p_128_c: 4956.4 1270.5
nv42_yuv420p_128_neon: 3074.6 ( 1.61x) 639.5 ( 1.99x)
nv42_yuv420p_1920_c: 35685.9 11732.5
nv42_yuv420p_1920_neon: 7995.1 ( 4.46x) 2437.2 ( 4.81x)
2024-08-26 11:04:46 +02:00
Martin Storsjö
cfe0a36352
libswscale: aarch64: Fix the indentation of some macro invocations
...
Signed-off-by: Martin Storsjö <martin@martin.st>
2024-08-22 14:40:30 +03:00
Ramiro Polla
181cd260db
swscale/aarch64/yuv2rgb: add neon yuv42{0,2}p -> gbrp unscaled colorspace converters
...
checkasm --bench on a Raspberry Pi 5 Model B Rev 1.0:
yuv420p_gbrp_128_c: 1243.0
yuv420p_gbrp_128_neon: 453.5
yuv420p_gbrp_1920_c: 18165.5
yuv420p_gbrp_1920_neon: 6700.0
yuv422p_gbrp_128_c: 1463.5
yuv422p_gbrp_128_neon: 471.5
yuv422p_gbrp_1920_c: 21343.7
yuv422p_gbrp_1920_neon: 6743.5
2024-08-18 22:26:17 +02:00
Zhao Zhili
4d90a76986
swscale/aarch64: Add argb/abgr to yuv
...
Test on Apple M1 with kperf:
: -O3 : -O3 -fno-vectorize
abgr_to_uv_8_c : 19.4 : 26.1
abgr_to_uv_8_neon : 29.9 : 51.1
abgr_to_uv_128_c : 146.4 : 558.9
abgr_to_uv_128_neon : 85.1 : 83.4
abgr_to_uv_1080_c : 1162.6 : 4786.4
abgr_to_uv_1080_neon : 819.6 : 826.6
abgr_to_uv_1920_c : 2063.6 : 8492.1
abgr_to_uv_1920_neon : 1435.1 : 1447.1
abgr_to_uv_half_8_c : 16.4 : 11.4
abgr_to_uv_half_8_neon : 35.6 : 20.4
abgr_to_uv_half_128_c : 108.6 : 359.4
abgr_to_uv_half_128_neon : 75.4 : 42.6
abgr_to_uv_half_1080_c : 883.4 : 2885.6
abgr_to_uv_half_1080_neon : 460.6 : 481.1
abgr_to_uv_half_1920_c : 1553.6 : 5106.9
abgr_to_uv_half_1920_neon : 817.6 : 820.4
abgr_to_y_8_c : 6.1 : 26.4
abgr_to_y_8_neon : 40.6 : 6.4
abgr_to_y_128_c : 99.9 : 390.1
abgr_to_y_128_neon : 67.4 : 55.9
abgr_to_y_1080_c : 735.9 : 3170.4
abgr_to_y_1080_neon : 534.6 : 536.6
abgr_to_y_1920_c : 1279.4 : 6016.4
abgr_to_y_1920_neon : 932.6 : 927.6
Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
2024-07-05 16:32:31 +08:00
Zhao Zhili
52422133ae
swscale/aarch64: Add bgra/rgba to yuv
...
Test on Apple M1 with kperf
: -O3 : -O3 -fno-vectorize
bgra_to_uv_8_c : 13.4 : 27.5
bgra_to_uv_8_neon : 37.4 : 41.7
bgra_to_uv_128_c : 155.9 : 550.2
bgra_to_uv_128_neon : 91.7 : 92.7
bgra_to_uv_1080_c : 1173.2 : 4558.2
bgra_to_uv_1080_neon : 822.7 : 809.5
bgra_to_uv_1920_c : 2078.2 : 8115.2
bgra_to_uv_1920_neon : 1437.7 : 1438.7
bgra_to_uv_half_8_c : 17.9 : 14.2
bgra_to_uv_half_8_neon : 37.4 : 10.5
bgra_to_uv_half_128_c : 103.9 : 326.0
bgra_to_uv_half_128_neon : 73.9 : 68.7
bgra_to_uv_half_1080_c : 850.2 : 3732.0
bgra_to_uv_half_1080_neon : 484.2 : 490.0
bgra_to_uv_half_1920_c : 1479.2 : 4942.7
bgra_to_uv_half_1920_neon : 824.2 : 824.7
bgra_to_y_8_c : 8.2 : 29.5
bgra_to_y_8_neon : 18.2 : 32.7
bgra_to_y_128_c : 101.4 : 361.5
bgra_to_y_128_neon : 74.9 : 73.7
bgra_to_y_1080_c : 739.4 : 3018.0
bgra_to_y_1080_neon : 613.4 : 544.2
bgra_to_y_1920_c : 1298.7 : 5326.0
bgra_to_y_1920_neon : 918.7 : 934.2
Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
2024-07-05 16:32:31 +08:00
Zhao Zhili
b8b71be07a
swscale/aarch64: Add bgr24 to yuv
...
Test on Apple M1 with kperf
: -O3 : -O3 -fno-vectorize
bgr24_to_uv_8_c : 28.5 : 52.5
bgr24_to_uv_8_neon : 54.5 : 59.7
bgr24_to_uv_128_c : 294.0 : 830.7
bgr24_to_uv_128_neon : 99.7 : 112.0
bgr24_to_uv_1080_c : 965.0 : 6624.0
bgr24_to_uv_1080_neon : 751.5 : 754.7
bgr24_to_uv_1920_c : 1693.2 : 11554.5
bgr24_to_uv_1920_neon : 1292.5 : 1307.5
bgr24_to_uv_half_8_c : 54.2 : 37.0
bgr24_to_uv_half_8_neon : 27.2 : 22.5
bgr24_to_uv_half_128_c : 127.2 : 392.5
bgr24_to_uv_half_128_neon : 63.0 : 52.0
bgr24_to_uv_half_1080_c : 880.2 : 3329.0
bgr24_to_uv_half_1080_neon : 401.5 : 390.7
bgr24_to_uv_half_1920_c : 1585.7 : 6390.7
bgr24_to_uv_half_1920_neon : 694.7 : 698.7
bgr24_to_y_8_c : 21.7 : 22.5
bgr24_to_y_8_neon : 797.2 : 25.5
bgr24_to_y_128_c : 88.0 : 280.5
bgr24_to_y_128_neon : 63.7 : 55.0
bgr24_to_y_1080_c : 616.7 : 2208.7
bgr24_to_y_1080_neon : 900.0 : 452.0
bgr24_to_y_1920_c : 1093.2 : 3894.7
bgr24_to_y_1920_neon : 777.2 : 767.5
Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
2024-07-05 16:32:31 +08:00
Ramiro Polla
75f1a8e071
swscale/aarch64: add neon {lum,chr}ConvertRange
...
chrRangeFromJpeg_8_c: 29.2
chrRangeFromJpeg_8_neon: 19.5
chrRangeFromJpeg_24_c: 80.5
chrRangeFromJpeg_24_neon: 34.0
chrRangeFromJpeg_128_c: 413.7
chrRangeFromJpeg_128_neon: 156.0
chrRangeFromJpeg_144_c: 471.0
chrRangeFromJpeg_144_neon: 174.2
chrRangeFromJpeg_256_c: 842.0
chrRangeFromJpeg_256_neon: 305.5
chrRangeFromJpeg_512_c: 1699.0
chrRangeFromJpeg_512_neon: 608.0
chrRangeToJpeg_8_c: 51.7
chrRangeToJpeg_8_neon: 22.7
chrRangeToJpeg_24_c: 149.7
chrRangeToJpeg_24_neon: 38.0
chrRangeToJpeg_128_c: 761.7
chrRangeToJpeg_128_neon: 176.7
chrRangeToJpeg_144_c: 866.2
chrRangeToJpeg_144_neon: 198.7
chrRangeToJpeg_256_c: 1516.5
chrRangeToJpeg_256_neon: 348.7
chrRangeToJpeg_512_c: 3067.2
chrRangeToJpeg_512_neon: 692.7
lumRangeFromJpeg_8_c: 24.0
lumRangeFromJpeg_8_neon: 17.0
lumRangeFromJpeg_24_c: 56.7
lumRangeFromJpeg_24_neon: 21.0
lumRangeFromJpeg_128_c: 294.5
lumRangeFromJpeg_128_neon: 76.7
lumRangeFromJpeg_144_c: 332.5
lumRangeFromJpeg_144_neon: 86.7
lumRangeFromJpeg_256_c: 586.0
lumRangeFromJpeg_256_neon: 152.2
lumRangeFromJpeg_512_c: 1190.0
lumRangeFromJpeg_512_neon: 298.0
lumRangeToJpeg_8_c: 31.7
lumRangeToJpeg_8_neon: 19.5
lumRangeToJpeg_24_c: 83.5
lumRangeToJpeg_24_neon: 24.2
lumRangeToJpeg_128_c: 440.5
lumRangeToJpeg_128_neon: 91.0
lumRangeToJpeg_144_c: 504.2
lumRangeToJpeg_144_neon: 101.0
lumRangeToJpeg_256_c: 879.7
lumRangeToJpeg_256_neon: 177.2
lumRangeToJpeg_512_c: 1794.2
lumRangeToJpeg_512_neon: 354.0
2024-06-18 23:12:41 +02:00
Zhao Zhili
9dac8495b0
swscale/aarch64: Add rgb24 to yuv implementation
...
Test on Apple M1:
rgb24_to_uv_8_c: 0.0
rgb24_to_uv_8_neon: 0.2
rgb24_to_uv_128_c: 1.0
rgb24_to_uv_128_neon: 0.5
rgb24_to_uv_1080_c: 7.0
rgb24_to_uv_1080_neon: 5.7
rgb24_to_uv_1920_c: 12.5
rgb24_to_uv_1920_neon: 9.5
rgb24_to_uv_half_8_c: 0.2
rgb24_to_uv_half_8_neon: 0.2
rgb24_to_uv_half_128_c: 1.0
rgb24_to_uv_half_128_neon: 0.5
rgb24_to_uv_half_1080_c: 6.2
rgb24_to_uv_half_1080_neon: 3.0
rgb24_to_uv_half_1920_c: 11.2
rgb24_to_uv_half_1920_neon: 5.2
rgb24_to_y_8_c: 0.2
rgb24_to_y_8_neon: 0.0
rgb24_to_y_128_c: 0.5
rgb24_to_y_128_neon: 0.5
rgb24_to_y_1080_c: 4.7
rgb24_to_y_1080_neon: 3.2
rgb24_to_y_1920_c: 8.0
rgb24_to_y_1920_neon: 5.7
On Pixel 6:
rgb24_to_uv_8_c: 30.7
rgb24_to_uv_8_neon: 56.9
rgb24_to_uv_128_c: 213.9
rgb24_to_uv_128_neon: 173.2
rgb24_to_uv_1080_c: 1649.9
rgb24_to_uv_1080_neon: 1424.4
rgb24_to_uv_1920_c: 2907.9
rgb24_to_uv_1920_neon: 2480.7
rgb24_to_uv_half_8_c: 36.2
rgb24_to_uv_half_8_neon: 33.4
rgb24_to_uv_half_128_c: 167.9
rgb24_to_uv_half_128_neon: 99.4
rgb24_to_uv_half_1080_c: 1293.9
rgb24_to_uv_half_1080_neon: 778.7
rgb24_to_uv_half_1920_c: 2292.7
rgb24_to_uv_half_1920_neon: 1328.7
rgb24_to_y_8_c: 19.7
rgb24_to_y_8_neon: 27.7
rgb24_to_y_128_c: 129.9
rgb24_to_y_128_neon: 96.7
rgb24_to_y_1080_c: 995.4
rgb24_to_y_1080_neon: 767.7
rgb24_to_y_1920_c: 1747.4
rgb24_to_y_1920_neon: 1337.2
Note both tests use clang as compiler, which has vectorization
enabled by default with -O3.
Reviewed-by: Rémi Denis-Courmont <remi@remlab.net>
Reviewed-by: Martin Storsjö <martin@martin.st>
Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
2024-06-11 01:12:09 +08:00
xufuji456
cc86343b96
lavc/hevcdsp_qpel_neon: using movi.16b instead of movi.2d
...
Building iOS platform with arm64, the compiler has a warning: "instruction movi.2d with immediate #0 may not function correctly on this CPU, converting to movi.16b"
Signed-off-by: xufuji456 <839789740@qq.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2023-11-28 15:54:49 +02:00
Martin Storsjö
a76b409dd0
aarch64: Reindent all assembly to 8/24 column indentation
...
libavcodec/aarch64/vc1dsp_neon.S is skipped here, as it intentionally
uses a layered indentation style to visually show how different
unrolled/interleaved phases fit together.
Signed-off-by: Martin Storsjö <martin@martin.st>
2023-10-21 23:25:54 +03:00
Martin Storsjö
93cda5a9c2
aarch64: Lowercase UXTW/SXTW and similar flags
...
Signed-off-by: Martin Storsjö <martin@martin.st>
2023-10-21 23:25:23 +03:00
Martin Storsjö
184103b310
aarch64: Consistently use lowercase for vector element specifiers
...
Signed-off-by: Martin Storsjö <martin@martin.st>
2023-10-21 23:25:18 +03:00
Hubert Mazur
2537fdc510
sw_scale: Add specializations for hscale 16 to 19
...
Provide arm64 neon optimized implementations for hscale16To19 with
filter sizes 4, 8 and X4.
The tests and benchmarks run on AWS Graviton 2 instances.
The results from a checkasm tool are shown below.
hscale_16_to_19__fs_4_dstW_512_c: 6216.0
hscale_16_to_19__fs_4_dstW_512_neon: 2257.0
hscale_16_to_19__fs_8_dstW_512_c: 10417.7
hscale_16_to_19__fs_8_dstW_512_neon: 3112.5
hscale_16_to_19__fs_12_dstW_512_c: 14890.5
hscale_16_to_19__fs_12_dstW_512_neon: 3899.0
hscale_16_to_19__fs_16_dstW_512_c: 19006.5
hscale_16_to_19__fs_16_dstW_512_neon: 5341.2
hscale_16_to_19__fs_32_dstW_512_c: 36629.5
hscale_16_to_19__fs_32_dstW_512_neon: 9502.7
hscale_16_to_19__fs_40_dstW_512_c: 45477.5
hscale_16_to_19__fs_40_dstW_512_neon: 11552.0
(Note, the checkasm tests for these functions haven't been
merged since they fail on x86.)
Signed-off-by: Hubert Mazur <hum@semihalf.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-11-01 15:24:58 +02:00
Hubert Mazur
9ccf8c5bfc
sw_scale: Add specializations for hscale 16 to 15
...
Add arm64 neon implementations for hscale 16 to 15 with filter
sizes 4, 8 and X4.
The tests and benchmarks run on AWS Graviton 2 instances.
The results from a checkasm tool are shown below.
hscale_16_to_15__fs_4_dstW_512_c: 6703.5
hscale_16_to_15__fs_4_dstW_512_neon: 2298.0
hscale_16_to_15__fs_8_dstW_512_c: 10983.0
hscale_16_to_15__fs_8_dstW_512_neon: 3216.5
hscale_16_to_15__fs_12_dstW_512_c: 15526.0
hscale_16_to_15__fs_12_dstW_512_neon: 3993.0
hscale_16_to_15__fs_16_dstW_512_c: 20183.5
hscale_16_to_15__fs_16_dstW_512_neon: 5369.7
hscale_16_to_15__fs_32_dstW_512_c: 39315.2
hscale_16_to_15__fs_32_dstW_512_neon: 9511.2
hscale_16_to_15__fs_40_dstW_512_c: 48995.7
hscale_16_to_15__fs_40_dstW_512_neon: 11570.0
(Note, the checkasm tests for these functions haven't been
merged since they fail on x86.)
Signed-off-by: Hubert Mazur <hum@semihalf.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-11-01 15:24:53 +02:00
Hubert Mazur
1e9cfa5bb0
sw_scale: Add specializations for hscale 8 to 19
...
Add arm64 neon implementations for hscale 8 to 19 with filter
sizes 4, 4X and 8. Both implementations are based on very similar ones
dedicated to hscale 8 to 15. The major changes refer to saving
the data - instead of writing the result as int16_t it is done
with int32_t.
These functions are heavily inspired on patches provided by J. Swinney
and M. Storsjö for hscale8to15 which were slightly adapted for
hscale8to19.
The tests and benchmarks run on AWS Graviton 2 instances. The results
from a checkasm tool shown below.
hscale_8_to_19__fs_4_dstW_512_c: 5663.2
hscale_8_to_19__fs_4_dstW_512_neon: 1259.7
hscale_8_to_19__fs_8_dstW_512_c: 9306.0
hscale_8_to_19__fs_8_dstW_512_neon: 2020.2
hscale_8_to_19__fs_12_dstW_512_c: 12932.7
hscale_8_to_19__fs_12_dstW_512_neon: 2462.5
hscale_8_to_19__fs_16_dstW_512_c: 16844.2
hscale_8_to_19__fs_16_dstW_512_neon: 4671.2
hscale_8_to_19__fs_32_dstW_512_c: 32803.7
hscale_8_to_19__fs_32_dstW_512_neon: 5474.2
hscale_8_to_19__fs_40_dstW_512_c: 40948.0
hscale_8_to_19__fs_40_dstW_512_neon: 6669.7
Signed-off-by: Hubert Mazur <hum@semihalf.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-11-01 15:24:43 +02:00
Martin Storsjö
cb803a0072
swscale: aarch64: Fix yuv2rgb with negative strides
...
Treat the 32 bit stride registers as signed.
Alternatively, we could make the stride arguments ptrdiff_t instead
of int, and changing all of the assembly to operate on these
registers with their full 64 bit width, but that's a much larger
and more intrusive change (and risks missing some operation, which
would clamp the intermediates to 32 bit still).
Fixes: https://trac.ffmpeg.org/ticket/9985
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-10-27 21:49:26 +03:00
Swinney, Jonathan
0d7caa5b09
swscale/aarch64: add vscale specializations
...
This commit adds new code paths for vscale when filterSize is 2, 4, or
8. By using specialized code with unrolling to match the filterSize we
can improve performance.
On AWS c7g (Graviton 3, Neoverse V1) instances:
before after
yuv2yuvX_2_0_512_accurate_neon: 558.8 268.9
yuv2yuvX_4_0_512_accurate_neon: 637.5 434.9
yuv2yuvX_8_0_512_accurate_neon: 1144.8 806.2
yuv2yuvX_16_0_512_accurate_neon: 2080.5 1853.7
Signed-off-by: Jonathan Swinney <jswinney@amazon.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-08-16 13:40:42 +03:00
Swinney, Jonathan
3e708722a2
swscale/aarch64: vscale optimization
...
Use scalar times vector multiply accumlate instructions instead of
vector times vector to remove the need for replicating load instructions
which are slightly slower.
On AWS c7g (Graviton 3, Neoverse V1) instances:
yuv2yuvX_8_0_512_accurate_neon: 1144.8 987.4
yuv2yuvX_16_0_512_accurate_neon: 2080.5 1869.4
Signed-off-by: Jonathan Swinney <jswinney@amazon.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-08-16 13:40:42 +03:00
Swinney, Jonathan
75ffca7eef
libswscale/aarch64: add another hscale specialization
...
This specialization handles the case where filtersize is 4 mod 8, e.g.
12, 20, etc. Aarch64 was previously using the c function for this case.
This implementation speeds up that case significantly.
hscale_8_to_15__fs_12_dstW_512_c: 6234.1
hscale_8_to_15__fs_12_dstW_512_neon: 1505.6
Signed-off-by: Jonathan Swinney <jswinney@amazon.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-08-16 12:08:38 +03:00
Swinney, Jonathan
0ea61725b1
swscale/aarch64: add hscale specializations
...
This patch adds code to support specializations of the hscale function
and adds a specialization for filterSize == 4.
ff_hscale8to15_4_neon is a complete rewrite. Since the main bottleneck
here is loading the data from src, this data is loaded a whole block
ahead and stored back to the stack to be loaded again with ld4. This
arranges the data for most efficient use of the vector instructions and
removes the need for completion adds at the end. The number of
iterations of the C per iteration of the assembly is increased from 4 to
8, but because of the prefetching, there must be a special section
without prefetching when dstW < 16.
This improves speed on Graviton 2 (Neoverse N1) dramatically in the case
where previously fs=8 would have been required.
before: hscale_8_to_15__fs_8_dstW_512_neon: 1962.8
after : hscale_8_to_15__fs_4_dstW_512_neon: 1220.9
Signed-off-by: Jonathan Swinney <jswinney@amazon.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-05-28 01:09:05 +03:00
Martin Storsjö
70db14376c
swscale: aarch64: Optimize the final summation in the hscale routine
...
Before: Cortex A53 A72 A73 Graviton 2 Graviton 3
hscale_8_to_15_width8_neon: 8273.0 4602.5 4289.5 2429.7 1629.1
hscale_8_to_15_width16_neon: 12405.7 6803.0 6359.0 3549.0 2378.4
hscale_8_to_15_width32_neon: 21258.7 11491.7 11469.2 5797.2 3919.6
hscale_8_to_15_width40_neon: 25652.0 14173.7 12488.2 6893.5 4810.4
After:
hscale_8_to_15_width8_neon: 7633.0 3981.5 3350.2 1980.7 1261.1
hscale_8_to_15_width16_neon: 11666.7 5951.0 5512.0 3080.7 2131.4
hscale_8_to_15_width32_neon: 20900.7 10733.2 9481.7 5275.2 3862.1
hscale_8_to_15_width40_neon: 24826.0 13536.2 11502.0 6397.2 4731.9
Thus, this gives overall a 8-29% speedup for the smaller filter
sizes, around 1-8% for the larger filter sizes.
Inspired by a patch by Jonathan Swinney <jswinney@amazon.com>.
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-04-22 10:49:46 +03:00
Anton Khirnov
1f80789bf7
sws: rename SwsContext.swscale to convert_unscaled
...
That function pointer is now used only for unscaled conversion.
2021-07-03 15:57:53 +02:00
Andreas Rheinhardt
f3c197b129
Include attributes.h directly
...
Some files currently rely on libavutil/cpu.h to include it for them;
yet said file won't use include it any more after the currently
deprecated functions are removed, so include attributes.h directly.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2021-04-19 14:34:10 +02:00
Lynne
3e098cca6e
aarch64/yuv2rgb_neon: fix return value
...
We return 0 for this particular architecture but should instead be
returning the number of lines.
Fixes users who check the return value matches what they expect.
2020-07-09 10:33:14 +01:00