Skip to content

Conversation

@shikharish
Copy link
Contributor

Related to #226.

@shikharish
Copy link
Contributor Author

shikharish commented Dec 22, 2025

first implementation.
following points need to be addressed:

  • need to make sure std::memcpy() is alway valid (currently it is UB)
  • current code only works in little endian

I have added a separate benchmark which I will merge to the main benchmark later.
The benchmark results vary significantly with compiler and architecture.

Apple M1 (8) @ 3.20 GHz using clang (arm64)

loaded db: a14 (Apple A14/M1)
parse_ip_fromchars                       :   0.23 GB/s   15.9 Ma/s  62.78 ns/d   1.96 GHz  122.81 c/d  372.87 i/d   8.60 c/b  26.11 i/b   3.04 i/c 
parse_ip_fastswar                        :   0.38 GB/s   26.3 Ma/s  37.96 ns/d   1.96 GHz  74.25 c/d  186.00 i/d   5.20 c/b  13.02 i/b   2.51 i/c 
sink=3767029944

Apple M1 (8) @ 3.20 GHz using gcc (arm64)

loaded db: a14 (Apple A14/M1)
parse_ip_fromchars                       :   0.41 GB/s   28.6 Ma/s  34.95 ns/d   1.96 GHz  68.43 c/d  224.40 i/d   4.79 c/b  15.71 i/b   3.28 i/c 
parse_ip_fastswar                        :   0.36 GB/s   25.2 Ma/s  39.64 ns/d   1.96 GHz  77.54 c/d  228.47 i/d   5.43 c/b  16.00 i/b   2.95 i/c 
sink=3738477744

Intel(R) Core(TM) i7-5500U (4) @ 3.00 GHz using clang (x86_64)

parse_ip_fromchars                       :   0.40 GB/s   27.7 Ma/s  36.07 ns/d   2.68 GHz  96.82 c/d  222.65 i/d   6.78 c/b  15.59 i/b   2.30 i/c 
parse_ip_fastswar                        :   0.42 GB/s   29.1 Ma/s  34.34 ns/d   2.68 GHz  92.05 c/d  223.86 i/d   6.45 c/b  15.68 i/b   2.43 i/c 
sink=3738477744

Intel(R) Core(TM) i7-5500U (4) @ 3.00 GHz using gcc (x86_64)

parse_ip_fromchars                       :   0.38 GB/s   26.4 Ma/s  37.85 ns/d   2.68 GHz  101.25 c/d  231.36 i/d   7.09 c/b  16.20 i/b   2.29 i/c 
parse_ip_fastswar                        :   0.43 GB/s   30.1 Ma/s  33.23 ns/d   2.68 GHz  89.05 c/d  185.92 i/d   6.24 c/b  13.02 i/b   2.09 i/c 
sink=3738477744

@shikharish
Copy link
Contributor Author

Request for review @lemire

@shikharish
Copy link
Contributor Author

A constexpr branch should be efficient and enough for handling uint8_t.
What could maybe be improved is the different branches for nd==0 and nd>3. This is what I could come up with. When I tried to minimize branches, the code sometimes performed significantly worse.
Please review.

@lemire
Copy link
Member

lemire commented Dec 22, 2025

I recommend adopting your benchmark immediately: #350 It does not use your SWAR approach, we just benchmark fast_float vs the standard.

For the memcpy, it is not usable in a constexpr context, but bit_cast is.

@shikharish
Copy link
Contributor Author

I recommend adopting your benchmark immediately: #350 It does not use your SWAR approach, we just benchmark fast_float vs the standard.

Alright. I will rebase this branch after the PR gets merged.

For the memcpy, it is not usable in a constexpr context, but bit_cast is.

Ah, we only choose the branch at compile time. I should rather do:

  constexpr bool is_uint8 = std::is_same_v<T, std::uint8_t>;

  if (is_uint8) {
    const size_t len = (size_t)(pend - p);
...

About the memcpy UB, I cannot find a fast (and generally-applicable) way to fix this. Adding a branch on len measurably slows the hot path. And we cannot assume the buffer will be padded properly.

Do we document a precondition: atleast 4 readable bytes from p otherwise it is not safe for uint8?

@lemire
Copy link
Member

lemire commented Dec 22, 2025

@shikharish

Ah, we only choose the branch at compile time. I should rather do

I think that this should be

if constexpr (is_uint8) {

although some care is needed not to break backward compatibility with earllier versions of C++.

About the memcpy UB, I cannot find a fast (and generally-applicable) way to fix this. Adding a branch on len measurably slows the hot path. And we cannot assume the buffer will be padded properly. Do we document a precondition: atleast 4 readable bytes from p otherwise it is not safe for uint8?

No. We don't do that. We can get creative, but we don't want to read beyond the buffer in general as this might cause a fatal crash in some instances.

Signed-off-by: Shikhar <shikharish05@gmail.com>
@shikharish
Copy link
Contributor Author

shikharish commented Dec 23, 2025

There is an issue with the benchmarking library. It is adding too much overhead(due to templating I think).

❯ sudo ./build/benchmarks/bench_ip
parse_ip_std_fromchars                   :   0.21 GB/s   14.6 Ma/s  68.67 ns/d   1.96 GHz  134.33 c/d  488.66 i/d   9.41 c/b  34.21 i/b   3.64 i/c 
parse_ip_fastfloat                       :   0.22 GB/s   15.5 Ma/s  64.54 ns/d   1.96 GHz  126.19 c/d  358.92 i/d   8.84 c/b  25.13 i/b   2.84 i/c 
sink=3749209294

This is inaccurate compared to the actual raw speed. I wrote a simple benchmark which just measures the throughput:

❯ ./simple_bench
std::from_chars                :  0.25 GB/s   56.8 ns/d
fast_float::from_chars         :  0.38 GB/s   37.9 ns/d
sink=3763786764

Signed-off-by: Shikhar <shikharish05@gmail.com>
@shikharish shikharish marked this pull request as ready for review December 23, 2025 00:44
Signed-off-by: Shikhar <shikharish05@gmail.com>
@lemire
Copy link
Member

lemire commented Dec 23, 2025

@shikharish Please see #351

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants