@another, @mstorsjo, @gramner
Introduction
I noticed a very clickbait bounty, I initially realized that company's original task was not to overtake implementation, but to advertise that Rust is 5% slower than C. Whether she actually pays or not is another matter. The main thing for Prossimo was to make a fuss that the current rav1d implementation was only 5% slower, so that the general public would think that the language was the same in speed.
I also noticed contributor's blog who tried to optimize rav1d, but he didn't go beyond 1%. Actually, I solved his problem, he came out at 0%. CHICKEN JOCKEY.
Well, first thing I decided to do was look dav1d at the memory organization in CPU cachelines, and I noticed that dav1d really consumes large structures. It is desirable to have structures of 64 bytes or less in size, it is easier for C/C++/C# compiler to process them. Since it's very difficult to recycle structures, I solved problem more simply.
At first, out of habit, I aligned, but I couldn't align because some of the data was bulging. And I remembered about taming enum to strict values so that it fits into 1 byte and it can be conveniently manually aligned. I made maximum values for each enum, and strictly specified a size 1 byte for them. I also realized that int in structures is a waste and decided to compress it to a short (2 bytes).
If you know how to use pahole, then you can view the object files in release, debug, non-optimized debug, and so on. By default, C/C++ compilers do not change the size of structures until the programmer himself specifies the packing or alignment attribute (keyword). The compiler also sometimes does not optimize enum itself to 1 byte if the strict flag is -fshort-enums, but I had to manually align to optimize the space, so I did this optimization in advance. Also, don't expect the compiler to change int to short itself.

Briefly changes
-
1080p up performance ~3%
-
4K up performance ~1%
This PR will decrease costs copying, moving, and creating object-structures only for common 64bit processors due to the 8-byte data alignment.
Smaller size structure or class, higher chance putting into CPU cache. Most processors are already 64 bit, so the change won't make it any worse.
In the description of each commit, I described steps in detail.
Pahole example output with struct Dav1dFrameContext:
- Comment /* XXX {n} bytes hole, try to pack */ shows where optimization is possible by rearranging the order of fields structures and classes
Master branch
struct Dav1dFrameContext {
Dav1dRef * seq_hdr_ref; /* 0 8 */
Dav1dSequenceHeader * seq_hdr; /* 8 8 */
Dav1dRef * frame_hdr_ref; /* 16 8 */
Dav1dFrameHeader * frame_hdr; /* 24 8 */
Dav1dThreadPicture refp[7]; /* 32 2072 */
/* --- cacheline 32 boundary (2048 bytes) was 56 bytes ago --- */
Dav1dPicture cur; /* 2104 272 */
/* --- cacheline 37 boundary (2368 bytes) was 8 bytes ago --- */
Dav1dThreadPicture sr_cur; /* 2376 296 */
/* --- cacheline 41 boundary (2624 bytes) was 48 bytes ago --- */
Dav1dRef * mvs_ref; /* 2672 8 */
refmvs_temporal_block * mvs; /* 2680 8 */
/* --- cacheline 42 boundary (2688 bytes) --- */
refmvs_temporal_block * ref_mvs[7]; /* 2688 56 */
Dav1dRef * ref_mvs_ref[7]; /* 2744 56 */
/* --- cacheline 43 boundary (2752 bytes) was 48 bytes ago --- */
Dav1dRef * cur_segmap_ref; /* 2800 8 */
Dav1dRef * prev_segmap_ref; /* 2808 8 */
/* --- cacheline 44 boundary (2816 bytes) --- */
uint8_t * cur_segmap; /* 2816 8 */
const uint8_t * prev_segmap; /* 2824 8 */
unsigned int refpoc[7]; /* 2832 28 */
unsigned int refrefpoc[7][7]; /* 2860 196 */
/* --- cacheline 47 boundary (3008 bytes) was 48 bytes ago --- */
uint8_t gmv_warp_allowed[7]; /* 3056 7 */
/* XXX 1 byte hole, try to pack */
CdfThreadContext in_cdf; /* 3064 24 */
/* --- cacheline 48 boundary (3072 bytes) was 16 bytes ago --- */
CdfThreadContext out_cdf; /* 3088 24 */
struct Dav1dTileGroup * tile; /* 3112 8 */
int n_tile_data_alloc; /* 3120 4 */
int n_tile_data; /* 3124 4 */
struct ScalableMotionParams svc[7][2]; /* 3128 112 */
/* --- cacheline 50 boundary (3200 bytes) was 40 bytes ago --- */
int resize_step[2]; /* 3240 8 */
int resize_start[2]; /* 3248 8 */
const Dav1dContext * c; /* 3256 8 */
/* --- cacheline 51 boundary (3264 bytes) --- */
Dav1dTileState * ts; /* 3264 8 */
int n_ts; /* 3272 4 */
/* XXX 4 bytes hole, try to pack */
const Dav1dDSPContext * dsp; /* 3280 8 */
struct {
recon_b_intra_fn recon_b_intra; /* 3288 8 */
recon_b_inter_fn recon_b_inter; /* 3296 8 */
filter_sbrow_fn filter_sbrow; /* 3304 8 */
filter_sbrow_fn filter_sbrow_deblock_cols; /* 3312 8 */
filter_sbrow_fn filter_sbrow_deblock_rows; /* 3320 8 */
/* --- cacheline 52 boundary (3328 bytes) --- */
void (*filter_sbrow_cdef)(Dav1dTaskContext *, int); /* 3328 8 */
filter_sbrow_fn filter_sbrow_resize; /* 3336 8 */
filter_sbrow_fn filter_sbrow_lr; /* 3344 8 */
backup_ipred_edge_fn backup_ipred_edge; /* 3352 8 */
read_coef_blocks_fn read_coef_blocks; /* 3360 8 */
copy_pal_block_fn copy_pal_block_y; /* 3368 8 */
copy_pal_block_fn copy_pal_block_uv; /* 3376 8 */
read_pal_plane_fn read_pal_plane; /* 3384 8 */
/* --- cacheline 53 boundary (3392 bytes) --- */
read_pal_uv_fn read_pal_uv; /* 3392 8 */
} bd_fn; /* 3288 112 */
int ipred_edge_sz; /* 3400 4 */
/* XXX 4 bytes hole, try to pack */
pixel * ipred_edge[3]; /* 3408 24 */
ptrdiff_t b4_stride; /* 3432 8 */
int w4; /* 3440 4 */
int h4; /* 3444 4 */
int bw; /* 3448 4 */
int bh; /* 3452 4 */
/* --- cacheline 54 boundary (3456 bytes) --- */
int sb128w; /* 3456 4 */
int sb128h; /* 3460 4 */
int sbh; /* 3464 4 */
int sb_shift; /* 3468 4 */
int sb_step; /* 3472 4 */
int sr_sb128w; /* 3476 4 */
uint16_t dq[8][3][2]; /* 3480 96 */
/* --- cacheline 55 boundary (3520 bytes) was 56 bytes ago --- */
const uint8_t * qm[19][3]; /* 3576 456 */
/* --- cacheline 63 boundary (4032 bytes) --- */
BlockContext * a; /* 4032 8 */
int a_sz; /* 4040 4 */
/* XXX 4 bytes hole, try to pack */
refmvs_frame rf; /* 4048 208 */
/* --- cacheline 66 boundary (4224 bytes) was 32 bytes ago --- */
uint8_t jnt_weights[7][7]; /* 4256 49 */
/* XXX 3 bytes hole, try to pack */
/* --- cacheline 67 boundary (4288 bytes) was 20 bytes ago --- */
int bitdepth_max; /* 4308 4 */
struct {
int next_tile_row[2]; /* 4312 8 */
atomic_int entropy_progress; /* 4320 4 */
atomic_int deblock_progress; /* 4324 4 */
atomic_uint * frame_progress; /* 4328 8 */
atomic_uint * copy_lpf_progress; /* 4336 8 */
Av1Block * b; /* 4344 8 */
/* --- cacheline 68 boundary (4352 bytes) --- */
int16_t * cbi; /* 4352 8 */
pixel * pal; /* 4360 8 */
uint8_t * pal_idx; /* 4368 8 */
coef * cf; /* 4376 8 */
int prog_sz; /* 4384 4 */
int cbi_sz; /* 4388 4 */
int pal_sz; /* 4392 4 */
int pal_idx_sz; /* 4396 4 */
int cf_sz; /* 4400 4 */
/* XXX 4 bytes hole, try to pack */
unsigned int * tile_start_off; /* 4408 8 */
} frame_thread; /* 4312 104 */
/* XXX last struct has 1 hole */
/* --- cacheline 69 boundary (4416 bytes) --- */
struct {
uint8_t * level; /* 4416 8 */
Av1Filter * mask; /* 4424 8 */
Av1Restoration * lr_mask; /* 4432 8 */
int mask_sz; /* 4440 4 */
int lr_mask_sz; /* 4444 4 */
int cdef_buf_plane_sz[2]; /* 4448 8 */
int cdef_buf_sbh; /* 4456 4 */
int lr_buf_plane_sz[2]; /* 4460 8 */
int re_sz; /* 4468 4 */
/* XXX 8 bytes hole, try to pack */
/* --- cacheline 70 boundary (4480 bytes) --- */
Av1FilterLUT lim_lut __attribute__((__aligned__(16))); /* 4480 144 */
/* --- cacheline 72 boundary (4608 bytes) was 16 bytes ago --- */
uint8_t lvl[8][4][8][2] __attribute__((__aligned__(16))); /* 4624 512 */
/* --- cacheline 80 boundary (5120 bytes) was 16 bytes ago --- */
int last_sharpness; /* 5136 4 */
/* XXX 4 bytes hole, try to pack */
uint8_t * tx_lpf_right_edge[2]; /* 5144 16 */
uint8_t * cdef_line_buf; /* 5160 8 */
uint8_t * lr_line_buf; /* 5168 8 */
pixel * cdef_line[2][3]; /* 5176 48 */
/* --- cacheline 81 boundary (5184 bytes) was 40 bytes ago --- */
pixel * cdef_lpf_line[3]; /* 5224 24 */
/* --- cacheline 82 boundary (5248 bytes) --- */
pixel * lr_lpf_line[3]; /* 5248 24 */
uint8_t * start_of_tile_row; /* 5272 8 */
int start_of_tile_row_sz; /* 5280 4 */
int need_cdef_lpf_copy; /* 5284 4 */
pixel * p[3]; /* 5288 24 */
/* --- cacheline 83 boundary (5312 bytes) --- */
pixel * sr_p[3]; /* 5312 24 */
int restore_planes; /* 5336 4 */
} __attribute__((__aligned__(16))) lf __attribute__((__aligned__(16))); /* 4416 928 */
/* XXX last struct has 4 bytes of padding, 2 holes */
struct {
pthread_mutex_t lock; /* 5344 40 */
/* --- cacheline 84 boundary (5376 bytes) was 8 bytes ago --- */
pthread_cond_t cond; /* 5384 48 */
struct TaskThreadData * ttd; /* 5432 8 */
/* --- cacheline 85 boundary (5440 bytes) --- */
struct Dav1dTask * tasks; /* 5440 8 */
struct Dav1dTask * tile_tasks[2]; /* 5448 16 */
struct Dav1dTask init_task; /* 5464 32 */
int num_tasks; /* 5496 4 */
int num_tile_tasks; /* 5500 4 */
/* --- cacheline 86 boundary (5504 bytes) --- */
atomic_int init_done; /* 5504 4 */
atomic_int done[2]; /* 5508 8 */
int retval; /* 5516 4 */
int update_set; /* 5520 4 */
atomic_int error; /* 5524 4 */
atomic_int task_counter; /* 5528 4 */
/* XXX 4 bytes hole, try to pack */
struct Dav1dTask * task_head; /* 5536 8 */
struct Dav1dTask * task_tail; /* 5544 8 */
struct Dav1dTask * task_cur_prev; /* 5552 8 */
struct {
atomic_int merge; /* 5560 4 */
/* XXX 4 bytes hole, try to pack */
/* --- cacheline 87 boundary (5568 bytes) --- */
pthread_mutex_t lock; /* 5568 40 */
Dav1dTask * head; /* 5608 8 */
Dav1dTask * tail; /* 5616 8 */
} pending_tasks; /* 5560 64 */
/* XXX last struct has 1 hole */
} task_thread; /* 5344 280 */
/* XXX last struct has 1 hole */
struct FrameTileThreadData tile_thread; /* 5624 16 */
/* XXX last struct has 4 bytes of padding */
/* size: 5648, cachelines: 89, members: 55 */
/* sum members: 5624, holes: 5, sum holes: 16 */
/* padding: 8 */
/* member types with holes: 3, total: 4 */
/* paddings: 2, sum paddings: 8 */
/* forced alignments: 1 */
/* last cacheline: 16 bytes */
} __attribute__((__aligned__(16)));
PR
struct Dav1dFrameContext {
Dav1dRef * seq_hdr_ref; /* 0 8 */
Dav1dSequenceHeader * seq_hdr; /* 8 8 */
Dav1dRef * frame_hdr_ref; /* 16 8 */
Dav1dFrameHeader * frame_hdr; /* 24 8 */
Dav1dThreadPicture refp[7]; /* 32 2016 */
/* --- cacheline 32 boundary (2048 bytes) --- */
Dav1dPicture cur; /* 2048 264 */
/* --- cacheline 36 boundary (2304 bytes) was 8 bytes ago --- */
Dav1dThreadPicture sr_cur; /* 2312 288 */
/* --- cacheline 40 boundary (2560 bytes) was 40 bytes ago --- */
Dav1dRef * mvs_ref; /* 2600 8 */
refmvs_temporal_block * mvs; /* 2608 8 */
refmvs_temporal_block * ref_mvs[7]; /* 2616 56 */
/* --- cacheline 41 boundary (2624 bytes) was 48 bytes ago --- */
Dav1dRef * ref_mvs_ref[7]; /* 2672 56 */
/* --- cacheline 42 boundary (2688 bytes) was 40 bytes ago --- */
Dav1dRef * cur_segmap_ref; /* 2728 8 */
Dav1dRef * prev_segmap_ref; /* 2736 8 */
uint8_t * cur_segmap; /* 2744 8 */
/* --- cacheline 43 boundary (2752 bytes) --- */
const uint8_t * prev_segmap; /* 2752 8 */
unsigned int refpoc[7]; /* 2760 28 */
unsigned int refrefpoc[7][7]; /* 2788 196 */
/* --- cacheline 46 boundary (2944 bytes) was 40 bytes ago --- */
CdfThreadContext in_cdf; /* 2984 24 */
/* --- cacheline 47 boundary (3008 bytes) --- */
CdfThreadContext out_cdf; /* 3008 24 */
struct Dav1dTileGroup * tile; /* 3032 8 */
short int n_tile_data_alloc; /* 3040 2 */
short int n_tile_data; /* 3042 2 */
struct ScalableMotionParams svc[7][2]; /* 3044 56 */
/* --- cacheline 48 boundary (3072 bytes) was 28 bytes ago --- */
short int resize_step[2]; /* 3100 4 */
short int resize_start[2]; /* 3104 4 */
short int ipred_edge_sz; /* 3108 2 */
short int n_ts; /* 3110 2 */
const Dav1dContext * c; /* 3112 8 */
Dav1dTileState * ts; /* 3120 8 */
const Dav1dDSPContext * dsp; /* 3128 8 */
/* --- cacheline 49 boundary (3136 bytes) --- */
struct {
recon_b_intra_fn recon_b_intra; /* 3136 8 */
recon_b_inter_fn recon_b_inter; /* 3144 8 */
filter_sbrow_fn filter_sbrow; /* 3152 8 */
filter_sbrow_fn filter_sbrow_deblock_cols; /* 3160 8 */
filter_sbrow_fn filter_sbrow_deblock_rows; /* 3168 8 */
void (*filter_sbrow_cdef)(Dav1dTaskContext *, int); /* 3176 8 */
filter_sbrow_fn filter_sbrow_resize; /* 3184 8 */
filter_sbrow_fn filter_sbrow_lr; /* 3192 8 */
/* --- cacheline 50 boundary (3200 bytes) --- */
backup_ipred_edge_fn backup_ipred_edge; /* 3200 8 */
read_coef_blocks_fn read_coef_blocks; /* 3208 8 */
copy_pal_block_fn copy_pal_block_y; /* 3216 8 */
copy_pal_block_fn copy_pal_block_uv; /* 3224 8 */
read_pal_plane_fn read_pal_plane; /* 3232 8 */
read_pal_uv_fn read_pal_uv; /* 3240 8 */
} bd_fn; /* 3136 112 */
pixel * ipred_edge[3]; /* 3248 24 */
/* --- cacheline 51 boundary (3264 bytes) was 8 bytes ago --- */
ptrdiff_t b4_stride; /* 3272 8 */
short int w4; /* 3280 2 */
short int h4; /* 3282 2 */
short int bw; /* 3284 2 */
short int bh; /* 3286 2 */
short int sb128w; /* 3288 2 */
short int sb128h; /* 3290 2 */
short int sbh; /* 3292 2 */
short int sb_shift; /* 3294 2 */
short int sb_step; /* 3296 2 */
short int sr_sb128w; /* 3298 2 */
short int a_sz; /* 3300 2 */
short int bitdepth_max; /* 3302 2 */
uint16_t dq[8][3][2]; /* 3304 96 */
/* --- cacheline 53 boundary (3392 bytes) was 8 bytes ago --- */
const uint8_t * qm[19][3]; /* 3400 456 */
/* --- cacheline 60 boundary (3840 bytes) was 16 bytes ago --- */
BlockContext * a; /* 3856 8 */
refmvs_frame rf; /* 3864 208 */
/* --- cacheline 63 boundary (4032 bytes) was 40 bytes ago --- */
uint8_t jnt_weights[7][7]; /* 4072 49 */
/* --- cacheline 64 boundary (4096 bytes) was 25 bytes ago --- */
uint8_t gmv_warp_allowed[7]; /* 4121 7 */
struct {
atomic_int entropy_progress; /* 4128 4 */
atomic_int deblock_progress; /* 4132 4 */
atomic_uint * frame_progress; /* 4136 8 */
atomic_uint * copy_lpf_progress; /* 4144 8 */
Av1Block * b; /* 4152 8 */
/* --- cacheline 65 boundary (4160 bytes) --- */
int16_t * cbi; /* 4160 8 */
pixel * pal; /* 4168 8 */
uint8_t * pal_idx; /* 4176 8 */
coef * cf; /* 4184 8 */
unsigned int * tile_start_off; /* 4192 8 */
short int next_tile_row[2]; /* 4200 4 */
short int prog_sz; /* 4204 2 */
short int cbi_sz; /* 4206 2 */
short int cf_sz; /* 4208 2 */
short int pal_sz; /* 4210 2 */
short int pal_idx_sz; /* 4212 2 */
} frame_thread; /* 4128 88 */
/* XXX last struct has 2 bytes of padding */
struct {
uint8_t * level; /* 4216 8 */
/* --- cacheline 66 boundary (4224 bytes) --- */
Av1Filter * mask; /* 4224 8 */
Av1Restoration * lr_mask; /* 4232 8 */
short int mask_sz; /* 4240 2 */
short int lr_mask_sz; /* 4242 2 */
short int cdef_buf_plane_sz[2]; /* 4244 4 */
short int cdef_buf_sbh; /* 4248 2 */
short int lr_buf_plane_sz[2]; /* 4250 4 */
short int re_sz; /* 4254 2 */
Av1FilterLUT lim_lut; /* 4256 144 */
/* --- cacheline 68 boundary (4352 bytes) was 48 bytes ago --- */
uint8_t lvl[8][4][8][2]; /* 4400 512 */
/* --- cacheline 76 boundary (4864 bytes) was 48 bytes ago --- */
uint8_t * tx_lpf_right_edge[2]; /* 4912 16 */
/* --- cacheline 77 boundary (4928 bytes) --- */
uint8_t * cdef_line_buf; /* 4928 8 */
uint8_t * lr_line_buf; /* 4936 8 */
pixel * cdef_line[2][3]; /* 4944 48 */
/* --- cacheline 78 boundary (4992 bytes) --- */
pixel * cdef_lpf_line[3]; /* 4992 24 */
pixel * lr_lpf_line[3]; /* 5016 24 */
short int last_sharpness; /* 5040 2 */
short int start_of_tile_row_sz; /* 5042 2 */
short int need_cdef_lpf_copy; /* 5044 2 */
uint8_t restore_planes; /* 5046 1 */
/* XXX 1 byte hole, try to pack */
uint8_t * start_of_tile_row; /* 5048 8 */
/* --- cacheline 79 boundary (5056 bytes) --- */
pixel * p[3]; /* 5056 24 */
pixel * sr_p[3]; /* 5080 24 */
} lf; /* 4216 888 */
/* XXX last struct has 1 hole */
struct {
pthread_mutex_t lock; /* 5104 40 */
/* --- cacheline 80 boundary (5120 bytes) was 24 bytes ago --- */
pthread_cond_t cond; /* 5144 48 */
/* --- cacheline 81 boundary (5184 bytes) was 8 bytes ago --- */
struct TaskThreadData * ttd; /* 5192 8 */
struct Dav1dTask * tasks; /* 5200 8 */
struct Dav1dTask * tile_tasks[2]; /* 5208 16 */
struct Dav1dTask init_task; /* 5224 24 */
/* XXX last struct has 2 holes */
/* --- cacheline 82 boundary (5248 bytes) --- */
atomic_int init_done; /* 5248 4 */
atomic_int done[2]; /* 5252 8 */
atomic_int error; /* 5260 4 */
atomic_int task_counter; /* 5264 4 */
short int num_tasks; /* 5268 2 */
short int num_tile_tasks; /* 5270 2 */
short int retval; /* 5272 2 */
short int update_set; /* 5274 2 */
/* XXX 4 bytes hole, try to pack */
struct Dav1dTask * task_head; /* 5280 8 */
struct Dav1dTask * task_tail; /* 5288 8 */
struct Dav1dTask * task_cur_prev; /* 5296 8 */
struct {
pthread_mutex_t lock; /* 5304 40 */
/* --- cacheline 83 boundary (5312 bytes) was 32 bytes ago --- */
Dav1dTask * head; /* 5344 8 */
Dav1dTask * tail; /* 5352 8 */
atomic_int merge; /* 5360 4 */
} pending_tasks; /* 5304 64 */
/* XXX last struct has 4 bytes of padding */
} task_thread; /* 5104 264 */
/* XXX last struct has 1 hole */
struct FrameTileThreadData tile_thread; /* 5368 16 */
/* XXX last struct has 6 bytes of padding */
/* size: 5384, cachelines: 85, members: 55 */
/* member types with holes: 2, total: 2 */
/* paddings: 2, sum paddings: 8 */
/* last cacheline: 8 bytes */
};
Results aligment and reduce structures
/* size: 5648, cachelines: 89, members: 55 */ -----> /* size: 5384, cachelines: 85, members: 55 */
Literally, 4 cachelines were saved on one structure, I have given only one example with Dav1dFrameContext structure, because it will take a long time to collect how much size has changed and number cachelines between master and my pull request (merge request).
Benchmark
I use 1080p Chimera (old) and Stream2_AV1_4K_22.7mbps.webm using ffmpeg I get summer_nature_4k.ivf
Im tested on gcc version 14.2.0 (Debian 14.2.0-19) with Meson Release configuration and -O3 optimization level.
I have an very old server with 72 threads and two processors on NUMA and 1U 88 threads Supermicro newer one also has 2 CPU on NUMA. On supermicro not tested.
To make measurements very accurate, I used hyperfine package and specified a warm-up parameter 2 and 10 starts dav1d by default.
P.S. about hyperfine: making tools for more convenient work with C/C++ projects on Rust (as Python) is a more useful thing than trying to replace an already working dav1d solution (or any FOSS infrastructure) that has been tested by time, many contributors and most vulnerabilities have been closed. New projects will also have performance and vulnerability issues. No need to waste your time writing another bike, there are much more useful things.
Master
debian@lenovo:~/GIT/dav1d/buildDir$ hyperfine --warmup 2 "tools/dav1d -q -i ~/GIT/dav1d/summer_nature_4k.ivf -o /dev/null --threads 72"
Benchmark 1: tools/dav1d -q -i ~/GIT/dav1d/summer_nature_4k.ivf -o /dev/null --threads 72
Time (mean ± σ): 30.603 s ± 0.445 s [User: 728.285 s, System: 18.614 s]
Range (min … max): 30.026 s … 31.233 s 10 runs
debian@lenovo:~/GIT/dav1d/buildDir$ hyperfine --warmup 2 "tools/dav1d -q -i ~/GIT/dav1d/Chimera-AV1-8bit-1920x1080-6736kbps.ivf -o /dev/null --threads 72"
Benchmark 1: tools/dav1d -q -i ~/GIT/dav1d/Chimera-AV1-8bit-1920x1080-6736kbps.ivf -o /dev/null --threads 72
Time (mean ± σ): 46.421 s ± 0.411 s [User: 562.544 s, System: 13.866 s]
Range (min … max): 45.373 s … 46.789 s 10 runs
PR
debian@lenovo:~/GIT/dav1d/buildDir$ hyperfine --warmup 2 "tools/dav1d -q -i ~/GIT/dav1d/summer_nature_4k.ivf -o /dev/null --threads 72"
Benchmark 1: tools/dav1d -q -i ~/GIT/dav1d/summer_nature_4k.ivf -o /dev/null --threads 72
Time (mean ± σ): 30.447 s ± 0.384 s [User: 725.203 s, System: 19.300 s]
Range (min … max): 29.819 s … 30.988 s 10 runs
debian@lenovo:~/GIT/dav1d/buildDir$ hyperfine --warmup 2 "tools/dav1d -q -i ~/GIT/dav1d/Chimera-AV1-8bit-1920x1080-6736kbps.ivf -o /dev/null --threads 72"
Benchmark 1: tools/dav1d -q -i ~/GIT/dav1d/Chimera-AV1-8bit-1920x1080-6736kbps.ivf -o /dev/null --threads 72
Time (mean ± σ): 45.103 s ± 0.268 s [User: 555.466 s, System: 14.086 s]
Range (min … max): 44.777 s … 45.498 s 10 runs
References:
https://hpc.rz.rptu.de/Tutorials/AVX/alignment.shtml
https://wr.informatik.uni-hamburg.de/_media/teaching/wintersemester_2013_2014/epc-14-haase-svenhendrik-alignmentinc-presentation.pdf
https://en.wikipedia.org/wiki/Data_structure_alignment
https://stackoverflow.com/a/20882083
https://zijishi.xyz/post/optimization-technique/learning-to-use-data-alignment/
My home lab (camera from FOSS Ubuntu Touch):

My favourite shooting gallery
