Skip to content

Conversation

@mcroomp
Copy link
Collaborator

@mcroomp mcroomp commented Jan 5, 2025

Improve speed of JPEG reading

  • Best devli ever (sorry @Melirius ) (also helps that (1 << bits) - 1 is recycled from BitReader::read)
  • Consistently use 32 bit math to avoid unnecessary widen/narrow conversions
  • Optimistically fill 8 byte fill register so that branching generally only happens in one location
  • Use LeptonError in BitReader to avoid unnecessary error conversions, since LeptonError is as small as std::io::Error now

@mcroomp mcroomp requested review from Melirius and Copilot January 5, 2025 17:17
@mcroomp mcroomp changed the title Devli Improve speed of JPEG reading Jan 5, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

Comments suppressed due to low confidence (6)

src/jpeg/bit_reader.rs:62

  • [nitpick] The change from returning u16 to u32 in the read function should be reviewed for consistency and correctness.
pub fn read(&mut self, bits_to_read: u32) -> Result<u32> {

src/jpeg/bit_reader.rs:62

  • Ensure that the new behavior introduced in the read function is covered by tests.
pub fn read(&mut self, bits_to_read: u32) -> Result<u32> {

src/helpers.rs:105

  • The condition (value << 1) > shifted might not correctly handle edge cases where value is close to the maximum value for u32. Consider using value & (shifted >> 1) != 0 as in the original code.
if (value << 1) > shifted {

src/jpeg/jpeg_read.rs:687

  • The conversion of the result of bit_reader.read(1)? to usize should be carefully reviewed to ensure it does not introduce any issues.
node = ctree.node[usize::from(node)][bit_reader.read(1)? as usize];

src/jpeg/jpeg_read.rs:903

  • The change from u16 to u32 in the decode_eobrun_bits function should be reviewed to ensure it does not introduce any issues.
fn decode_eobrun_bits(s: u8, n: u32) -> u32 {

src/jpeg/jpeg_read.rs:722

  • The conversion of literal_bits to u32 in the read_dc function should be reviewed to ensure it does not introduce any issues.
let value = bit_reader.read(literal_bits)?;

/// with as many bits as possible, leaving the possibility of unwinding via the undo_read_ahead
/// function in certain corner cases.
#[inline(never)]
#[cold]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really helps with performance? By the rule of thumb refill should be a bit too frequent to being cold. I'll check on my box.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tradeoff was with having more registers free for the inner loop, but it's possible that it's faster without the cold. fill_register is definitely cold since it gets called almost never.

Copy link
Collaborator

@Melirius Melirius Jan 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With cold:

ivan@ivan-5950:~/lepton_jpeg_rust$ sudo rm images/img_52MP_7k2.lep; sudo perf stat -B -e cache-references,cache-misses,cycles,ic_fetch_stall.ic_stall_back_pressure,stalled-cycles-frontend,instructions,branch-instructions,branch-misses,ic_fetch_stall.ic_stall_any,ic_fetch_stall.ic_stall_dq_empty,l2_cache_misses_from_ic_miss,l2_latency.l2_cycles_waiting_on_fills,faults,migrations taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/img_52MP_7k.jpg images/img_52MP_7k2.lep
2025-01-12T14:26:43.217Z INFO  [lepton_jpeg::structs::lepton_file_writer] compressing to Lepton format
2025-01-12T14:26:43.583Z INFO  [lepton_jpeg::structs::lepton_file_writer] Number of threads: 8
2025-01-12T14:26:44.712Z INFO  [lepton_jpeg::structs::lepton_file_writer] worker threads 7732ms of CPU time in 1127ms of wall time
2025-01-12T14:26:44.712Z INFO  [lepton_jpeg::structs::lepton_file_writer] decompressing to verify contents
2025-01-12T14:26:46.304Z INFO  [lepton_jpeg_util] compressed input 22171278, output 17324074 bytes (compression = 28.0%)
2025-01-12T14:26:46.304Z INFO  [lepton_jpeg_util] Main thread CPU: 3095ms, Worker thread CPU: 18774 ms, walltime: 3095 ms

 Performance counter stats for 'taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/img_52MP_7k.jpg images/img_52MP_7k2.lep':

       857 527 770      cache-references                                                        (25,32%)
        76 481 649      cache-misses                     #    8,92% of all cache refs           (25,47%)
    14 095 457 313      cycles                                                                  (25,47%)
       865 595 905      ic_fetch_stall.ic_stall_back_pressure                                        (25,51%)
       935 546 969      stalled-cycles-frontend          #    6,64% frontend cycles idle        (25,31%)
    34 775 075 664      instructions                     #    2,47  insn per cycle            
                                                  #    0,03  stalled cycles per insn     (25,24%)
     3 665 427 290      branch-instructions                                                     (25,05%)
       140 713 890      branch-misses                    #    3,84% of all branches             (25,07%)
     5 325 860 720      ic_fetch_stall.ic_stall_any                                             (25,28%)
        39 528 616      ic_fetch_stall.ic_stall_dq_empty                                        (25,38%)
        62 527 706      l2_cache_misses_from_ic_miss                                            (25,24%)
     1 842 115 928      l2_latency.l2_cycles_waiting_on_fills                                        (25,12%)
           188 277      faults                                                                
                 1      migrations                                                            

       3,129010978 seconds time elapsed

       2,821312000 seconds user
       0,302604000 seconds sys

Without cold:

ivan@ivan-5950:~/lepton_jpeg_rust$ sudo rm images/img_52MP_7k2.lep; sudo perf stat -B -e cache-references,cache-misses,cycles,ic_fetch_stall.ic_stall_back_pressure,stalled-cycles-frontend,instructions,branch-instructions,branch-misses,ic_fetch_stall.ic_stall_any,ic_fetch_stall.ic_stall_dq_empty,l2_cache_misses_from_ic_miss,l2_latency.l2_cycles_waiting_on_fills,faults,migrations taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/img_52MP_7k.jpg images/img_52MP_7k2.lep
2025-01-12T14:25:52.492Z INFO  [lepton_jpeg::structs::lepton_file_writer] compressing to Lepton format
2025-01-12T14:25:52.974Z INFO  [lepton_jpeg::structs::lepton_file_writer] Number of threads: 8
2025-01-12T14:25:54.453Z INFO  [lepton_jpeg::structs::lepton_file_writer] worker threads 10251ms of CPU time in 1478ms of wall time
2025-01-12T14:25:54.453Z INFO  [lepton_jpeg::structs::lepton_file_writer] decompressing to verify contents
2025-01-12T14:25:56.028Z INFO  [lepton_jpeg_util] compressed input 22171278, output 17324074 bytes (compression = 28.0%)
2025-01-12T14:25:56.028Z INFO  [lepton_jpeg_util] Main thread CPU: 3547ms, Worker thread CPU: 21139 ms, walltime: 3547 ms

 Performance counter stats for 'taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/img_52MP_7k.jpg images/img_52MP_7k2.lep':

       854 375 836      cache-references                                                        (25,26%)
        74 116 401      cache-misses                     #    8,67% of all cache refs           (25,55%)
    14 048 593 475      cycles                                                                  (25,55%)
       487 211 314      ic_fetch_stall.ic_stall_back_pressure                                        (25,48%)
       869 359 514      stalled-cycles-frontend          #    6,19% frontend cycles idle        (25,31%)
    35 076 159 662      instructions                     #    2,50  insn per cycle            
                                                  #    0,02  stalled cycles per insn     (25,18%)
     3 638 127 413      branch-instructions                                                     (25,14%)
       140 480 887      branch-misses                    #    3,86% of all branches             (25,13%)
     5 203 535 058      ic_fetch_stall.ic_stall_any                                             (25,11%)
        36 559 121      ic_fetch_stall.ic_stall_dq_empty                                        (24,97%)
        61 473 795      l2_cache_misses_from_ic_miss                                            (24,92%)
     1 770 575 776      l2_latency.l2_cycles_waiting_on_fills                                        (24,98%)
           188 278      faults                                                                
                 1      migrations                                                            

       3,585066548 seconds time elapsed

       3,233798000 seconds user
       0,346549000 seconds sys

@Melirius
Copy link
Collaborator

Melirius commented Jan 6, 2025

Nice work, congrats to speed up devli more!

@Melirius
Copy link
Collaborator

Melirius commented Jan 12, 2025

But main is faster (all compiled on rustc 1.81.0 (eeb90cda1 2024-09-04) with RUSTFLAGS='-C target-feature=+avx2 -C target-feature=+lzcnt' cargo build --release):

ivan@ivan-5950:~/lepton_jpeg_rust$ sudo rm images/img_52MP_7k2.lep; sudo perf stat -B -e cache-references,cache-misses,cycles,ic_fetch_stall.ic_stall_back_pressure,stalled-cycles-frontend,instructions,branch-instructions,branch-misses,ic_fetch_stall.ic_stall_any,ic_fetch_stall.ic_stall_dq_empty,l2_cache_misses_from_ic_miss,l2_latency.l2_cycles_waiting_on_fills,faults,migrations taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/img_52MP_7k.jpg images/img_52MP_7k2.lep
2025-01-12T14:30:24.018Z INFO  [lepton_jpeg::structs::lepton_file_writer] compressing to Lepton format
2025-01-12T14:30:24.455Z INFO  [lepton_jpeg::structs::lepton_file_writer] Number of threads: 8
2025-01-12T14:30:25.763Z INFO  [lepton_jpeg::structs::lepton_file_writer] worker threads 9133ms of CPU time in 1306ms of wall time
2025-01-12T14:30:25.763Z INFO  [lepton_jpeg::structs::lepton_file_writer] decompressing to verify contents
2025-01-12T14:30:27.342Z INFO  [lepton_jpeg_util] compressed input 22171278, output 17324074 bytes (compression = 28.0%)
2025-01-12T14:30:27.342Z INFO  [lepton_jpeg_util] Main thread CPU: 3335ms, Worker thread CPU: 20100 ms, walltime: 3335 ms

 Performance counter stats for 'taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/img_52MP_7k.jpg images/img_52MP_7k2.lep':

       858 748 386      cache-references                                                        (25,06%)
        74 830 355      cache-misses                     #    8,71% of all cache refs           (24,91%)
    13 836 453 808      cycles                                                                  (24,97%)
       777 193 113      ic_fetch_stall.ic_stall_back_pressure                                        (25,21%)
       920 473 108      stalled-cycles-frontend          #    6,65% frontend cycles idle        (25,37%)
    35 558 555 176      instructions                     #    2,57  insn per cycle            
                                                  #    0,03  stalled cycles per insn     (25,47%)
     3 646 640 572      branch-instructions                                                     (25,51%)
       129 001 386      branch-misses                    #    3,54% of all branches             (25,43%)
     4 922 193 512      ic_fetch_stall.ic_stall_any                                             (25,37%)
        33 547 664      ic_fetch_stall.ic_stall_dq_empty                                        (25,23%)
        64 219 569      l2_cache_misses_from_ic_miss                                            (25,19%)
     1 720 951 168      l2_latency.l2_cycles_waiting_on_fills                                        (25,14%)
           188 227      faults                                                                
                 1      migrations                                                            

       3,373385487 seconds time elapsed

       3,000880000 seconds user
       0,369615000 seconds sys

Similar results I got on my Intel laptop. Maybe you use different RUST version?

UPD: The same on rustc 1.84.0 (9fc6b4312 2025-01-07)

@github-actions
Copy link

Test Results

8 tests   - 163   7 ✅  - 164   1s ⏱️ - 19m 7s
1 suites  -   2   0 💤 ±  0 
1 files   ±  0   1 ❌ +  1 

For more details on these failures, see this check.

Results for commit 2c98bb2. ± Comparison against base commit 621555c.

This pull request removes 163 tests.
lepton_jpeg ‑ jpeg::jpeg_read::tests::test_benchmark_read_block
lepton_jpeg ‑ jpeg::jpeg_read::tests::test_benchmark_read_jpeg
lepton_jpeg ‑ jpeg::jpeg_write::tests::roundtrip_baseline_jpeg
lepton_jpeg ‑ jpeg::jpeg_write::tests::roundtrip_progressive_jpeg
lepton_jpeg ‑ jpeg::jpeg_write::tests::test_benchmark_write_block
lepton_jpeg ‑ jpeg::jpeg_write::tests::test_benchmark_write_jpeg
lepton_jpeg ‑ jpeg::jpeg_write::tests::test_encode_block_long_zero_cnt
lepton_jpeg ‑ jpeg::jpeg_write::tests::test_encode_block_magnitude
lepton_jpeg ‑ jpeg::jpeg_write::tests::test_encode_block_seq
lepton_jpeg ‑ jpeg::jpeg_write::tests::test_encode_block_seq_zero
…

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants