Commit graph

36921 commits

Author SHA1 Message Date
Michael Niedermayer
ad7a3f5294 avcodec/utils: Fix memleak with subtitles and sidedata
Fixes: 454/fuzz-3-ffmpeg_SUBTITLE_AV_CODEC_ID_MOV_TEXT_fuzzer

Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/targets/ffmpeg
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2017-02-03 00:27:28 +01:00
James Almer
c8467abbad x86/rv34dsp: add ff_rv34_idct_dc_add_sse2
Also disable ff_rv34_idct_dc_add_mmx on x86_64 as the presence of sse2
is guaranteed in such builds.

Signed-off-by: James Almer <jamrial@gmail.com>
2017-02-02 17:51:21 -03:00
James Almer
ab5c4d006d x86/vp8dsp: add ff_vp8_idct_dc_add_sse2
Also disable ff_vp8_idct_dc_add_mmx on x86_64 as the presence of sse2
is guaranteed in such builds.

Signed-off-by: James Almer <jamrial@gmail.com>
2017-02-02 17:18:58 -03:00
addr-see-the-website@aetey.se
712ad6b661 libavcodec/cinepakenc.c: comments cleanup (contents)
Change the encoding of the original developer name from ISO-8859-1 to UTF-8.
Remove the stale/completed TODO list.
Fix two small typos.

Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2017-02-02 16:31:13 +01:00
Michael Niedermayer
61f70416f8 avcodec/dca_lbr: Fix off by 1 error in freq check
Fixes out of array read
Fixes: 510/clusterfuzz-testcase-5737865715646464

Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/targets/ffmpeg
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2017-02-02 15:48:50 +01:00
Steinar H. Gunderson
08b098169b speedhq: fix out-of-bounds write
Certain alpha run lengths (for SHQ1/SHQ3/SHQ5) could be stored in
both long and short versions, and we would only accept the short version,
returning -1 (invalid code) for the others. This could cause an
out-of-bounds write on malicious input, as discovered by
Andreas Cadhalpun during fuzzing.

Fix by simply allowing both versions, leaving no invalid codes
in the alpha VLC.

Signed-off-by: Andreas Cadhalpun <Andreas.Cadhalpun@googlemail.com>
2017-02-02 01:12:07 +01:00
Michael Niedermayer
0126cd95cc avcodec/ituh263dec: Correct timestamp recovery for B frames
Improves u263_b-frames_5.avi

Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2017-02-01 22:01:34 +01:00
Paul B Mahol
71ca855c9d avcodec/wmalosslessdec: remove warning message as bug is fixed
Signed-off-by: Paul B Mahol <onemda@gmail.com>
2017-02-01 19:35:24 +01:00
bnnm
c61b28e042 avcodec/atrac3: Add multichannel joint stereo ATRAC3
Multichannel joint stereo simply interleaves stereo pairs (6ch: 2ch + 2ch + 2ch), so each pair is decoded separatedly.

***

To test my changes, I converted examples to wav with ffmpeg.exe (old and new), and compared them to see they are byte-exact.

Regular 2ch files (JS and normal) were straightforward to test.

For multichannel, to check each JS pair is correctly decoded separatedly I did:
- manually demux 6ch.msf into 3 pairs and convert them (2ch_1.wav + 2ch_2.wav + 2ch_3.wav)
- convert the 6ch.msf file to wav (with my changes)
- manually demux the 6ch.wav into 3 pairs (6ch_d1.wav + 6ch_d2.wav + 6ch_d3.wav)
- compare each pair (ex. 2ch_3.wav vs 6ch_d3.wav): all pairs are byte-exact.

The new code just processes each JS pair separatedly, there are no algorithm changes.
It could be improved a bit but I'm not sure about typical styles.
I've only seen 6ch .MSF (probably the AT3 spec only supports 2ch audio).

Signed-off-by: bnnm <bananaman255@gmail.com>
2017-02-01 19:14:12 +01:00
Michael Niedermayer
4f651c723b avcodec/h263: Remove disabled and wrong code from ff_h263_loop_filter()
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2017-02-01 19:09:27 +01:00
Michael Niedermayer
901c625494 avcodec/ituh263dec: Use correct error codes in ff_h263_decode_mb()
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2017-02-01 19:09:27 +01:00
Michael Niedermayer
e00c516d1e avcodec/ituh263dec: Correct indention
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2017-02-01 19:09:27 +01:00
Carl Eugen Hoyos
aecdb14ad9 lavc/error_resilience: Remove two unused variables. 2017-02-01 17:51:59 +01:00
Michael Niedermayer
b28ae1e09b avcodec/ituh263dec: Implement B frame support with UMV
Fixes: u263_b-frames_1.avi
Fixes part of Ticket1536

return -1 is used here as it is used in similar code in this function, I intend
to replace it by proper error codes in the whole function.

Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2017-02-01 17:35:54 +01:00
Clément Bœsch
26d5caf679 Merge commit 'a1f6a2dfda'
* commit 'a1f6a2dfda':
  ratecontrol: Reorder functions to avoid forward declarations

Merged, but this seems to break the clear separation of 1-pass vs
2-pass.

Merged-by: Clément Bœsch <u@pkh.me>
2017-02-01 14:50:21 +01:00
Clément Bœsch
566bfd59c9 Merge commit 'd639dcdae0'
* commit 'd639dcdae0':
  ratecontrol: Move Xvid-related functions to the place they are actually used

Merged-by: Clément Bœsch <u@pkh.me>
2017-02-01 14:21:36 +01:00
Andreas Cadhalpun
842e98b4d8 pgssubdec: reset rle_data_len/rle_remaining_len on allocation error
The code relies on their validity and otherwise can try to access a NULL
object->rle pointer, causing segmentation faults.

Reviewed-by: Michael Niedermayer <michael@niedermayer.cc>
Signed-off-by: Andreas Cadhalpun <Andreas.Cadhalpun@googlemail.com>
2017-02-01 02:21:28 +01:00
Michael Niedermayer
536ac72f46 Revert "Merge commit '0a39c9ac0b'"
The assumption this is based on is wrong, the code is not always run with bitexact flags

This reverts commit a956164e1e, reversing
changes made to f6005907fd.

Approved-by: James Almer <jamrial@gmail.com>
2017-02-01 02:01:07 +01:00
Michael Niedermayer
3782656631 avcodec/mjpegdec: Check for for the bitstream end in mjpeg_decode_scan_progressive_ac()
Fixes timeout
Fixes: 496/clusterfuzz-testcase-5805083497332736

Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/targets/ffmpeg
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2017-02-01 01:36:50 +01:00
James Almer
1df08cae82 Merge commit 'b4bb959383'
* commit 'b4bb959383':
  ratecontrol: Drop commented out cruft

Merged-by: James Almer <jamrial@gmail.com>
2017-01-31 15:50:56 -03:00
James Almer
ba5d089381 Merge commit 'd06dfaa5cb'
* commit 'd06dfaa5cb':
  x86: huffyuv: Use EXTERNAL_SSSE3_FAST convenience macro where appropriate

Merged-by: James Almer <jamrial@gmail.com>
2017-01-31 15:36:49 -03:00
James Almer
ac774cfa57 Merge commit '4efab89332'
* commit '4efab89332':
  x86: Use *_FAST/*_SLOW CPU feature detection macros where appropriate

Merged-by: James Almer <jamrial@gmail.com>
2017-01-31 15:08:19 -03:00
James Almer
a956164e1e Merge commit '0a39c9ac0b'
* commit '0a39c9ac0b':
  x86: hpeldsp: Don't check for bitexact flag when initializing VP3-specific code

Merged-by: James Almer <jamrial@gmail.com>
2017-01-31 14:59:29 -03:00
James Almer
f6005907fd Merge commit '95c1df929b'
* commit '95c1df929b':
  x86: hpeldsp: Drop unused function parameters

Merged-by: James Almer <jamrial@gmail.com>
2017-01-31 14:56:11 -03:00
James Almer
4d0e89ce27 Merge commit 'c3e83ad3b7'
* commit 'c3e83ad3b7':
  x86: hpeldsp: Use EXTERNAL_SSE2_FAST where appropriate

Merged-by: James Almer <jamrial@gmail.com>
2017-01-31 14:53:27 -03:00
James Almer
ca8a3978e5 Merge commit '1dfc3cf89d'
* commit '1dfc3cf89d':
  x86: hpeldsp: Split off VP3-specific bits into a separate file

Merged-by: James Almer <jamrial@gmail.com>
2017-01-31 14:49:29 -03:00
Clément Bœsch
4039076dc3 Merge commit '76f7e70aa0'
* commit '76f7e70aa0':
  h264dec: handle zero-sized NAL units in get_last_needed_nal()

See 641dccc2aa

Merged-by: Clément Bœsch <cboesch@gopro.com>
2017-01-31 17:23:14 +01:00
Clément Bœsch
7c300a8ed4 lavc/hevc: remove a few random spaces to reduce diff with libav 2017-01-31 17:02:24 +01:00
Clément Bœsch
78d16eb452 Merge commit 'fca3c3b619'
* commit 'fca3c3b619':
  hevc: Add AVX2 DC IDCT

Mostly noop as we already have that code.

In the ASM, code is merged with the exception of SECTION which is kept
uppercase for consistency with the rest of the codebase.

Still in the ASM, the prototype comment is fixed to honor the '_' added
from the original commit.

idct_dc_proto() is dropped as it's not used anymore here.

Merged-by: Clément Bœsch <cboesch@gopro.com>
2017-01-31 16:53:37 +01:00
Clément Bœsch
05018c2cda Merge commit 'cc16da75c2'
* commit 'cc16da75c2':
  hevc: Add coefficient limiting to speed up IDCT

Noop again as we have these changes already, only random spacing
changes.

Merged-by: Clément Bœsch <cboesch@gopro.com>
2017-01-31 16:05:06 +01:00
Clément Bœsch
a604115f72 Merge commit 'a92fd8a062'
* commit 'a92fd8a062':
  hevc: Add DC IDCT

Noop, only spacing adjusted.

Merged-by: Clément Bœsch <cboesch@gopro.com>
2017-01-31 15:55:44 +01:00
Clément Bœsch
2456efcc0f Merge commit '4f247de3b7'
* commit '4f247de3b7':
  hevcdsp_template: Templatize IDCT

This commit is a noop as we already have that code from a previous
commits (see 92cccb7bcd).

Spacing is adjusted to reduce the diff.

Merged-by: Clément Bœsch <cboesch@gopro.com>
2017-01-31 15:49:12 +01:00
Clément Bœsch
d0e132bab6 Merge commit '1bd890ad17'
* commit '1bd890ad17':
  hevc: Separate adding residual to prediction from IDCT

This commit should be a noop but isn't because of the following renames:

- transform_add  → add_residual
- transform_skip → dequant
- idct_4x4_luma  → transform_4x4_luma

Merged-by: Clément Bœsch <cboesch@gopro.com>
2017-01-31 15:31:34 +01:00
Carl Eugen Hoyos
12f7c091e8 lavc/alac: Export samplerate.
Fixes ticket #6096.
2017-01-31 10:49:40 +01:00
Clément Bœsch
3a3554871a lavc/hevcdsp: fix pretty printing mistake
"Issue" introduced in 83976e40e8.
2017-01-30 12:03:30 +01:00
Matthieu Bouron
2ae8278832 lavc/mjpegdec: consume SOS data even if the frame is discarded
Speeds up next marker search when a SOS marker is found but the frame is
discarded (which happens in avformat_find_stream_info).
2017-01-29 21:54:16 +01:00
Michael Niedermayer
f28299da8d avcodec/h264dec: Clear ref_count on slice header processing failure
Fixes using freed memory
Introduced in 7448019890
Fixes: 471/fuzz-1-ffmpeg_VIDEO_AV_CODEC_ID_H264_fuzzer

Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/targets/ffmpeg
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2017-01-27 00:52:49 +01:00
Paul B Mahol
b4a13d442a avcodecc/ccaption_dec: remove extra word from long codec description
Signed-off-by: Paul B Mahol <onemda@gmail.com>
2017-01-25 12:00:02 +01:00
Michael Niedermayer
2080bc3371 avcodec/utils: correct align value for interplay
Fixes out of array access
Fixes: 452/fuzz-1-ffmpeg_VIDEO_AV_CODEC_ID_INTERPLAY_VIDEO_fuzzer

Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/targets/ffmpeg
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2017-01-25 00:56:37 +01:00
Carl Eugen Hoyos
6d6faa2a2d lavc/svq3: Fail for media key encryption.
Tested-by: ami_stuff

Fixes a part of ticket #6094.
2017-01-24 23:40:13 +01:00
Michael Niedermayer
9e6a242755 avcodec/vp56: Check for the bitstream end, pass error codes on
Fixes timeout
Fixes: 446/fuzz-3-ffmpeg_VIDEO_AV_CODEC_ID_VP6_fuzzer

Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/targets/ffmpeg
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2017-01-24 23:01:04 +01:00
Martin Storsjö
9f10cff610 aarch64: Add NEON optimizations for 10 and 12 bit vp9 loop filter
This work is sponsored by, and copyright, Google.

This is similar to the arm version, but due to the larger registers
on aarch64, we can do 8 pixels at a time for all filter sizes.

Examples of runtimes vs the 32 bit version, on a Cortex A53:
                                             ARM AArch64
vp9_loop_filter_h_4_8_10bpp_neon:          213.2   172.6
vp9_loop_filter_h_8_8_10bpp_neon:          281.2   244.2
vp9_loop_filter_h_16_8_10bpp_neon:         657.0   444.5
vp9_loop_filter_h_16_16_10bpp_neon:       1280.4   877.7
vp9_loop_filter_mix2_h_44_16_10bpp_neon:   397.7   358.0
vp9_loop_filter_mix2_h_48_16_10bpp_neon:   465.7   429.0
vp9_loop_filter_mix2_h_84_16_10bpp_neon:   465.7   428.0
vp9_loop_filter_mix2_h_88_16_10bpp_neon:   533.7   499.0
vp9_loop_filter_mix2_v_44_16_10bpp_neon:   271.5   244.0
vp9_loop_filter_mix2_v_48_16_10bpp_neon:   330.0   305.0
vp9_loop_filter_mix2_v_84_16_10bpp_neon:   329.0   306.0
vp9_loop_filter_mix2_v_88_16_10bpp_neon:   386.0   365.0
vp9_loop_filter_v_4_8_10bpp_neon:          150.0   115.2
vp9_loop_filter_v_8_8_10bpp_neon:          209.0   175.5
vp9_loop_filter_v_16_8_10bpp_neon:         492.7   345.2
vp9_loop_filter_v_16_16_10bpp_neon:        951.0   682.7

This is significantly faster than the ARM version in almost
all cases except for the mix2 functions.

Based on START_TIMER/STOP_TIMER wrapping around a few individual
functions, the speedup vs C code is around 2-3x.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-01-24 22:36:11 +02:00
Martin Storsjö
ceb36b8178 aarch64: Add NEON optimizations for 10 and 12 bit vp9 itxfm
This work is sponsored by, and copyright, Google.

Compared to the arm version, on aarch64 we can keep the full 8x8
transform in registers, and for 16x16 and 32x32, we can process
it in slices of 4 pixels instead of 2.

Examples of runtimes vs the 32 bit version, on a Cortex A53:
                                                ARM  AArch64
vp9_inv_adst_adst_4x4_sub4_add_10_neon:       111.0    109.7
vp9_inv_adst_adst_8x8_sub8_add_10_neon:       914.0    733.5
vp9_inv_adst_adst_16x16_sub16_add_10_neon:   5184.0   3745.7
vp9_inv_dct_dct_4x4_sub1_add_10_neon:          65.0     65.7
vp9_inv_dct_dct_4x4_sub4_add_10_neon:         100.0     96.7
vp9_inv_dct_dct_8x8_sub1_add_10_neon:         111.0    119.7
vp9_inv_dct_dct_8x8_sub8_add_10_neon:         618.0    494.7
vp9_inv_dct_dct_16x16_sub1_add_10_neon:       295.1    284.6
vp9_inv_dct_dct_16x16_sub2_add_10_neon:      2303.2   1883.9
vp9_inv_dct_dct_16x16_sub8_add_10_neon:      2984.8   2189.3
vp9_inv_dct_dct_16x16_sub16_add_10_neon:     3890.0   2799.4
vp9_inv_dct_dct_32x32_sub1_add_10_neon:      1044.4   1012.7
vp9_inv_dct_dct_32x32_sub2_add_10_neon:     13333.7   9695.1
vp9_inv_dct_dct_32x32_sub16_add_10_neon:    18531.3  12459.8
vp9_inv_dct_dct_32x32_sub32_add_10_neon:    24470.7  16160.2
vp9_inv_wht_wht_4x4_sub4_add_10_neon:          83.0     79.7

The larger transforms are significantly faster than the corresponding
ARM versions.

The speedup vs C code is smaller than in 32 bit mode, probably
because the 64 bit intermediates in the C code can be expressed
more efficiently in aarch64.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-01-24 22:36:08 +02:00
Martin Storsjö
638eceed47 aarch64: Add NEON optimizations for 10 and 12 bit vp9 MC
This work is sponsored by, and copyright, Google.

This has mostly got the same differences to the 8 bit version as
in the arm version. For the horizontal filters, we do 16 pixels
in parallel as well. For the 8 pixel wide vertical filters, we can
accumulate 4 rows before storing, just as in the 8 bit version.

Examples of runtimes vs the 32 bit version, on a Cortex A53:
                                           ARM   AArch64
vp9_avg4_10bpp_neon:                      35.7      30.7
vp9_avg8_10bpp_neon:                      93.5      84.7
vp9_avg16_10bpp_neon:                    324.4     296.6
vp9_avg32_10bpp_neon:                   1236.5    1148.2
vp9_avg64_10bpp_neon:                   4639.6    4571.1
vp9_avg_8tap_smooth_4h_10bpp_neon:       130.0     128.0
vp9_avg_8tap_smooth_4hv_10bpp_neon:      440.0     440.5
vp9_avg_8tap_smooth_4v_10bpp_neon:       114.0     105.5
vp9_avg_8tap_smooth_8h_10bpp_neon:       327.0     314.0
vp9_avg_8tap_smooth_8hv_10bpp_neon:      918.7     865.4
vp9_avg_8tap_smooth_8v_10bpp_neon:       330.0     300.2
vp9_avg_8tap_smooth_16h_10bpp_neon:     1187.5    1155.5
vp9_avg_8tap_smooth_16hv_10bpp_neon:    2663.1    2591.0
vp9_avg_8tap_smooth_16v_10bpp_neon:     1107.4    1078.3
vp9_avg_8tap_smooth_64h_10bpp_neon:    17754.6   17454.7
vp9_avg_8tap_smooth_64hv_10bpp_neon:   33285.2   33001.5
vp9_avg_8tap_smooth_64v_10bpp_neon:    16066.9   16048.6
vp9_put4_10bpp_neon:                      25.5      21.7
vp9_put8_10bpp_neon:                      56.0      52.0
vp9_put16_10bpp_neon/armv8:              183.0     163.1
vp9_put32_10bpp_neon/armv8:              678.6     563.1
vp9_put64_10bpp_neon/armv8:             2679.9    2195.8
vp9_put_8tap_smooth_4h_10bpp_neon:       120.0     118.0
vp9_put_8tap_smooth_4hv_10bpp_neon:      435.2     435.0
vp9_put_8tap_smooth_4v_10bpp_neon:       107.0      98.2
vp9_put_8tap_smooth_8h_10bpp_neon:       303.0     290.0
vp9_put_8tap_smooth_8hv_10bpp_neon:      893.7     828.7
vp9_put_8tap_smooth_8v_10bpp_neon:       305.5     263.5
vp9_put_8tap_smooth_16h_10bpp_neon:     1089.1    1059.2
vp9_put_8tap_smooth_16hv_10bpp_neon:    2578.8    2452.4
vp9_put_8tap_smooth_16v_10bpp_neon:     1009.5     933.5
vp9_put_8tap_smooth_64h_10bpp_neon:    16223.4   15918.6
vp9_put_8tap_smooth_64hv_10bpp_neon:   32153.0   31016.2
vp9_put_8tap_smooth_64v_10bpp_neon:    14516.5   13748.1

These are generally about as fast as the corresponding ARM
routines on the same CPU (at least on the A53), in most cases
marginally faster.

The speedup vs C code is around 4-9x.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-01-24 22:36:05 +02:00
Martin Storsjö
48ad3fe1be aarch64: vp9dsp: Restructure the bpp checks
This work is sponsored by, and copyright, Google.

This is more in line with how it will be extended for more bitdepths.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-01-24 22:36:02 +02:00
Martin Storsjö
1e5d87eec3 arm: Add NEON optimizations for 10 and 12 bit vp9 loop filter
This work is sponsored by, and copyright, Google.

This is pretty much similar to the 8 bpp version, but in some senses
simpler. All input pixels are 16 bits, and all intermediates also fit
in 16 bits, so there's no lengthening/narrowing in the filter at all.

For the full 16 pixel wide filter, we can only process 4 pixels at a time
(using an implementation very much similar to the one for 8 bpp),
but we can do 8 pixels at a time for the 4 and 8 pixel wide filters with
a different implementation of the core filter.

Examples of relative speedup compared to the C version, from checkasm:
                                   Cortex    A7     A8     A9    A53
vp9_loop_filter_h_4_8_10bpp_neon:          1.83   2.16   1.40   2.09
vp9_loop_filter_h_8_8_10bpp_neon:          1.39   1.67   1.24   1.70
vp9_loop_filter_h_16_8_10bpp_neon:         1.56   1.47   1.10   1.81
vp9_loop_filter_h_16_16_10bpp_neon:        1.94   1.69   1.33   2.24
vp9_loop_filter_mix2_h_44_16_10bpp_neon:   2.01   2.27   1.67   2.39
vp9_loop_filter_mix2_h_48_16_10bpp_neon:   1.84   2.06   1.45   2.19
vp9_loop_filter_mix2_h_84_16_10bpp_neon:   1.89   2.20   1.47   2.29
vp9_loop_filter_mix2_h_88_16_10bpp_neon:   1.69   2.12   1.47   2.08
vp9_loop_filter_mix2_v_44_16_10bpp_neon:   3.16   3.98   2.50   4.05
vp9_loop_filter_mix2_v_48_16_10bpp_neon:   2.84   3.64   2.25   3.77
vp9_loop_filter_mix2_v_84_16_10bpp_neon:   2.65   3.45   2.16   3.54
vp9_loop_filter_mix2_v_88_16_10bpp_neon:   2.55   3.30   2.16   3.55
vp9_loop_filter_v_4_8_10bpp_neon:          2.85   3.97   2.24   3.68
vp9_loop_filter_v_8_8_10bpp_neon:          2.27   3.19   1.96   3.08
vp9_loop_filter_v_16_8_10bpp_neon:         3.42   2.74   2.26   4.40
vp9_loop_filter_v_16_16_10bpp_neon:        2.86   2.44   1.93   3.88

The speedup vs C code measured in checkasm is around 1.1-4x.
These numbers are quite inconclusive though, since the checkasm test
runs multiple filterings on top of each other, so later rounds might
end up with different codepaths (different decisions on which filter
to apply, based on input pixel differences).

Based on START_TIMER/STOP_TIMER wrapping around a few individual
functions, the speedup vs C code is around 2-4x.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-01-24 22:35:59 +02:00
Martin Storsjö
2ed67eba96 arm: Add NEON optimizations for 10 and 12 bit vp9 itxfm
This work is sponsored by, and copyright, Google.

This is structured similarly to the 8 bit version. In the 8 bit
version, the coefficients are 16 bits, and intermediates are 32 bits.

Here, the coefficients are 32 bit. For the 4x4 transforms for 10 bit
content, the intermediates also fit in 32 bits, but for all other
transforms (4x4 for 12 bit content, and 8x8 and larger for both 10
and 12 bit) the intermediates are 64 bit.

For the existing 8 bit case, the 8x8 transform fit all coefficients in
registers; for 10/12 bit, when the coefficients are 32 bit, the 8x8
transform also has to be done in slices of 4 pixels (just as 16x16 and
32x32 for 8 bit).

The slice width also shrinks from 4 elements to 2 elements in parallel
for the 16x16 and 32x32 cases.

The 16 bit coefficients from idct_coeffs and similar tables also need
to be lenghtened to 32 bit in order to be used in multiplication with
vectors with 32 bit elements. This leads to the fixed coefficient
vectors needing more space, leading to more cases where they have to
be reloaded within the transform (in iadst16).

This technically would need testing in checkasm for subpartitions
in increments of 2, but that slows down normal checkasm runs
excessively.

Examples of relative speedup compared to the C version, from checkasm:
                                     Cortex    A7     A8     A9    A53
vp9_inv_adst_adst_4x4_sub4_add_10_neon:      4.83  11.36   5.22   6.77
vp9_inv_adst_adst_8x8_sub8_add_10_neon:      4.12   7.60   4.06   4.84
vp9_inv_adst_adst_16x16_sub16_add_10_neon:   3.93   8.16   4.52   5.35
vp9_inv_dct_dct_4x4_sub1_add_10_neon:        1.36   2.57   1.41   1.61
vp9_inv_dct_dct_4x4_sub4_add_10_neon:        4.24   8.66   5.06   5.81
vp9_inv_dct_dct_8x8_sub1_add_10_neon:        2.63   4.18   1.68   2.87
vp9_inv_dct_dct_8x8_sub4_add_10_neon:        4.52   9.47   4.24   5.39
vp9_inv_dct_dct_8x8_sub8_add_10_neon:        3.45   7.34   3.45   4.30
vp9_inv_dct_dct_16x16_sub1_add_10_neon:      3.56   6.21   2.47   4.32
vp9_inv_dct_dct_16x16_sub2_add_10_neon:      5.68  12.73   5.28   7.07
vp9_inv_dct_dct_16x16_sub8_add_10_neon:      4.42   9.28   4.24   5.45
vp9_inv_dct_dct_16x16_sub16_add_10_neon:     3.41   7.29   3.35   4.19
vp9_inv_dct_dct_32x32_sub1_add_10_neon:      4.52   8.35   3.83   6.40
vp9_inv_dct_dct_32x32_sub2_add_10_neon:      5.86  13.19   6.14   7.04
vp9_inv_dct_dct_32x32_sub16_add_10_neon:     4.29   8.11   4.59   5.06
vp9_inv_dct_dct_32x32_sub32_add_10_neon:     3.31   5.70   3.56   3.84
vp9_inv_wht_wht_4x4_sub4_add_10_neon:        1.89   2.80   1.82   1.97

The speedup compared to the C functions is around 1.3 to 7x for the
full transforms, even higher for the smaller subpartitions.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-01-24 22:35:56 +02:00
Martin Storsjö
a4d4bad75c arm: Add NEON optimizations for 10 and 12 bit vp9 MC
This work is sponsored by, and copyright, Google.

The plain pixel put/copy functions are used from the 8 bit version,
for the double size (e.g. put16 uses ff_vp9_copy32_neon), and a new
copy128 is added.

Compared with the 8 bit version, the filters can no longer use the
trick to accumulate in 16 bit with only saturation at the end, but now
the accumulators need to be 32 bit. This avoids the need to keep track
of which filter index is the largest though, reducing the size of the
executable code for these filters.

For the horizontal filters, we only do 4 or 8 pixels wide in parallel
(while doing two rows at a time), since we don't have enough register
space to filter 16 pixels wide.

For the vertical filters, we still do 4 and 8 pixels in parallel just
as in the 8 bit case, but we need to store the output after every 2
rows instead of after every 4 rows.

Examples of relative speedup compared to the C version, from checkasm:
                               Cortex    A7     A8     A9    A53
vp9_avg4_10bpp_neon:                   2.25   2.44   3.05   2.16
vp9_avg8_10bpp_neon:                   3.66   8.48   3.86   3.50
vp9_avg16_10bpp_neon:                  3.39   8.26   3.37   2.72
vp9_avg32_10bpp_neon:                  4.03  10.20   4.07   3.42
vp9_avg64_10bpp_neon:                  4.15  10.01   4.13   3.70
vp9_avg_8tap_smooth_4h_10bpp_neon:     3.38   6.22   3.41   4.75
vp9_avg_8tap_smooth_4hv_10bpp_neon:    3.89   6.39   4.30   5.32
vp9_avg_8tap_smooth_4v_10bpp_neon:     5.32   9.73   6.34   7.31
vp9_avg_8tap_smooth_8h_10bpp_neon:     4.45   9.40   4.68   6.87
vp9_avg_8tap_smooth_8hv_10bpp_neon:    4.64   8.91   5.44   6.47
vp9_avg_8tap_smooth_8v_10bpp_neon:     6.44  13.42   8.68   8.79
vp9_avg_8tap_smooth_64h_10bpp_neon:    4.66   9.02   4.84   7.71
vp9_avg_8tap_smooth_64hv_10bpp_neon:   4.61   9.14   4.92   7.10
vp9_avg_8tap_smooth_64v_10bpp_neon:    6.90  14.13   9.57  10.41
vp9_put4_10bpp_neon:                   1.33   1.46   2.09   1.33
vp9_put8_10bpp_neon:                   1.57   3.42   1.83   1.84
vp9_put16_10bpp_neon:                  1.55   4.78   2.17   1.89
vp9_put32_10bpp_neon:                  2.06   5.35   2.14   2.30
vp9_put64_10bpp_neon:                  3.00   2.41   1.95   1.66
vp9_put_8tap_smooth_4h_10bpp_neon:     3.19   5.81   3.31   4.63
vp9_put_8tap_smooth_4hv_10bpp_neon:    3.86   6.22   4.32   5.21
vp9_put_8tap_smooth_4v_10bpp_neon:     5.40   9.77   6.08   7.21
vp9_put_8tap_smooth_8h_10bpp_neon:     4.22   8.41   4.46   6.63
vp9_put_8tap_smooth_8hv_10bpp_neon:    4.56   8.51   5.39   6.25
vp9_put_8tap_smooth_8v_10bpp_neon:     6.60  12.43   8.17   8.89
vp9_put_8tap_smooth_64h_10bpp_neon:    4.41   8.59   4.54   7.49
vp9_put_8tap_smooth_64hv_10bpp_neon:   4.43   8.58   5.34   6.63
vp9_put_8tap_smooth_64v_10bpp_neon:    7.26  13.92   9.27  10.92

For the larger 8tap filters, the speedup vs C code is around 4-14x.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-01-24 22:35:50 +02:00
Martin Storsjö
cda9a3e80b arm: vp9dsp: Restructure the bpp checks
This work is sponsored by, and copyright, Google.

This is more in line with how it will be extended for more bitdepths.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-01-24 22:35:44 +02:00
Michael Niedermayer
755933cb5c avcodec/mjpegdec: Check remaining bitstream in ljpeg_decode_yuv_scan()
Fixes timeout
Fixes: 445/fuzz-3-ffmpeg_VIDEO_AV_CODEC_ID_MJPEG_fuzzer
Fixes: 456/fuzz-2-ffmpeg_VIDEO_AV_CODEC_ID_JPEGLS_fuzzer

Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/targets/ffmpeg
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2017-01-24 17:50:03 +01:00