-
Notifications
You must be signed in to change notification settings - Fork 11
Improve speed of JPEG reading #131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
Comments suppressed due to low confidence (6)
src/jpeg/bit_reader.rs:62
- [nitpick] The change from returning u16 to u32 in the read function should be reviewed for consistency and correctness.
pub fn read(&mut self, bits_to_read: u32) -> Result<u32> {
src/jpeg/bit_reader.rs:62
- Ensure that the new behavior introduced in the read function is covered by tests.
pub fn read(&mut self, bits_to_read: u32) -> Result<u32> {
src/helpers.rs:105
- The condition
(value << 1) > shiftedmight not correctly handle edge cases wherevalueis close to the maximum value foru32. Consider usingvalue & (shifted >> 1) != 0as in the original code.
if (value << 1) > shifted {
src/jpeg/jpeg_read.rs:687
- The conversion of the result of
bit_reader.read(1)?tousizeshould be carefully reviewed to ensure it does not introduce any issues.
node = ctree.node[usize::from(node)][bit_reader.read(1)? as usize];
src/jpeg/jpeg_read.rs:903
- The change from
u16tou32in thedecode_eobrun_bitsfunction should be reviewed to ensure it does not introduce any issues.
fn decode_eobrun_bits(s: u8, n: u32) -> u32 {
src/jpeg/jpeg_read.rs:722
- The conversion of
literal_bitstou32in theread_dcfunction should be reviewed to ensure it does not introduce any issues.
let value = bit_reader.read(literal_bits)?;
| /// with as many bits as possible, leaving the possibility of unwinding via the undo_read_ahead | ||
| /// function in certain corner cases. | ||
| #[inline(never)] | ||
| #[cold] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really helps with performance? By the rule of thumb refill should be a bit too frequent to being cold. I'll check on my box.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tradeoff was with having more registers free for the inner loop, but it's possible that it's faster without the cold. fill_register is definitely cold since it gets called almost never.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With cold:
ivan@ivan-5950:~/lepton_jpeg_rust$ sudo rm images/img_52MP_7k2.lep; sudo perf stat -B -e cache-references,cache-misses,cycles,ic_fetch_stall.ic_stall_back_pressure,stalled-cycles-frontend,instructions,branch-instructions,branch-misses,ic_fetch_stall.ic_stall_any,ic_fetch_stall.ic_stall_dq_empty,l2_cache_misses_from_ic_miss,l2_latency.l2_cycles_waiting_on_fills,faults,migrations taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/img_52MP_7k.jpg images/img_52MP_7k2.lep
2025-01-12T14:26:43.217Z INFO [lepton_jpeg::structs::lepton_file_writer] compressing to Lepton format
2025-01-12T14:26:43.583Z INFO [lepton_jpeg::structs::lepton_file_writer] Number of threads: 8
2025-01-12T14:26:44.712Z INFO [lepton_jpeg::structs::lepton_file_writer] worker threads 7732ms of CPU time in 1127ms of wall time
2025-01-12T14:26:44.712Z INFO [lepton_jpeg::structs::lepton_file_writer] decompressing to verify contents
2025-01-12T14:26:46.304Z INFO [lepton_jpeg_util] compressed input 22171278, output 17324074 bytes (compression = 28.0%)
2025-01-12T14:26:46.304Z INFO [lepton_jpeg_util] Main thread CPU: 3095ms, Worker thread CPU: 18774 ms, walltime: 3095 ms
Performance counter stats for 'taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/img_52MP_7k.jpg images/img_52MP_7k2.lep':
857 527 770 cache-references (25,32%)
76 481 649 cache-misses # 8,92% of all cache refs (25,47%)
14 095 457 313 cycles (25,47%)
865 595 905 ic_fetch_stall.ic_stall_back_pressure (25,51%)
935 546 969 stalled-cycles-frontend # 6,64% frontend cycles idle (25,31%)
34 775 075 664 instructions # 2,47 insn per cycle
# 0,03 stalled cycles per insn (25,24%)
3 665 427 290 branch-instructions (25,05%)
140 713 890 branch-misses # 3,84% of all branches (25,07%)
5 325 860 720 ic_fetch_stall.ic_stall_any (25,28%)
39 528 616 ic_fetch_stall.ic_stall_dq_empty (25,38%)
62 527 706 l2_cache_misses_from_ic_miss (25,24%)
1 842 115 928 l2_latency.l2_cycles_waiting_on_fills (25,12%)
188 277 faults
1 migrations
3,129010978 seconds time elapsed
2,821312000 seconds user
0,302604000 seconds sys
Without cold:
ivan@ivan-5950:~/lepton_jpeg_rust$ sudo rm images/img_52MP_7k2.lep; sudo perf stat -B -e cache-references,cache-misses,cycles,ic_fetch_stall.ic_stall_back_pressure,stalled-cycles-frontend,instructions,branch-instructions,branch-misses,ic_fetch_stall.ic_stall_any,ic_fetch_stall.ic_stall_dq_empty,l2_cache_misses_from_ic_miss,l2_latency.l2_cycles_waiting_on_fills,faults,migrations taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/img_52MP_7k.jpg images/img_52MP_7k2.lep
2025-01-12T14:25:52.492Z INFO [lepton_jpeg::structs::lepton_file_writer] compressing to Lepton format
2025-01-12T14:25:52.974Z INFO [lepton_jpeg::structs::lepton_file_writer] Number of threads: 8
2025-01-12T14:25:54.453Z INFO [lepton_jpeg::structs::lepton_file_writer] worker threads 10251ms of CPU time in 1478ms of wall time
2025-01-12T14:25:54.453Z INFO [lepton_jpeg::structs::lepton_file_writer] decompressing to verify contents
2025-01-12T14:25:56.028Z INFO [lepton_jpeg_util] compressed input 22171278, output 17324074 bytes (compression = 28.0%)
2025-01-12T14:25:56.028Z INFO [lepton_jpeg_util] Main thread CPU: 3547ms, Worker thread CPU: 21139 ms, walltime: 3547 ms
Performance counter stats for 'taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/img_52MP_7k.jpg images/img_52MP_7k2.lep':
854 375 836 cache-references (25,26%)
74 116 401 cache-misses # 8,67% of all cache refs (25,55%)
14 048 593 475 cycles (25,55%)
487 211 314 ic_fetch_stall.ic_stall_back_pressure (25,48%)
869 359 514 stalled-cycles-frontend # 6,19% frontend cycles idle (25,31%)
35 076 159 662 instructions # 2,50 insn per cycle
# 0,02 stalled cycles per insn (25,18%)
3 638 127 413 branch-instructions (25,14%)
140 480 887 branch-misses # 3,86% of all branches (25,13%)
5 203 535 058 ic_fetch_stall.ic_stall_any (25,11%)
36 559 121 ic_fetch_stall.ic_stall_dq_empty (24,97%)
61 473 795 l2_cache_misses_from_ic_miss (24,92%)
1 770 575 776 l2_latency.l2_cycles_waiting_on_fills (24,98%)
188 278 faults
1 migrations
3,585066548 seconds time elapsed
3,233798000 seconds user
0,346549000 seconds sys
|
Nice work, congrats to speed up devli more! |
|
But main is faster (all compiled on Similar results I got on my Intel laptop. Maybe you use different RUST version? UPD: The same on |
Test Results8 tests - 163 7 ✅ - 164 1s ⏱️ - 19m 7s For more details on these failures, see this check. Results for commit 2c98bb2. ± Comparison against base commit 621555c. This pull request removes 163 tests. |
Improve speed of JPEG reading