Potential speedups for HTML::Entities::encode_entities

I think we could get some speedups in `encode_entities` by caching some common operations.  The examples below would most benefit users encoding lots of data that is heavy on non-named, numeric entities. We may have similar benefits elsewhere.

The first is to cache the results of the `sprintf` in `num_entity`. This could be done with no effect on behavior, in exchange for some hash entries. Here are my experiments.

```
use Benchmark ':all';
use HTML::Entities;

timethese( 300_000,
    {
        original => sub {
            map { HTML::Entities::num_entity($_) } 1..100
        },
        cached => sub {
            map { cached_num_entity($_) } 1..100
        },
    }
);


my %cache;
sub cached_num_entity {
    return $cache{$_[0]} ||= sprintf("&#x%X;", ord($_[0]));
}
```

gives these results:

```
    cached: 10 wallclock secs (10.52 usr +  0.00 sys = 10.52 CPU) @ 28517.11/s (n=300000)
  original: 15 wallclock secs (14.28 usr +  0.00 sys = 14.28 CPU) @ 21008.40/s (n=300000)
```

Bottom line: Hash lookup is faster than the `sprintf`, so let's cache it.

---

The other tweak would be to cache the call to `num_entity` inside the main regex in `encode_entities`.  Swap this:

    $$ref =~ s/([^\n\r\t !\#\$%\(-;=?-~])/$char2entity{$1} || num_entity($1)/ge;

for this

    $$ref =~ s/([^\n\r\t !\#\$%\(-;=?-~])/$char2entity{$1} ||= num_entity($1)/ge;

This would have the side effect of modifying the `%char2entity` hash, which is visible to the outside world. If that wasn't OK, we could have a private copy of the hash specifically so it would be modifiable. The potential downside (or upside?) of that would be that if someone outside the module modified `%char2entity`, it would have no effect on `encode_entities`.

For benchmarking `encode_entities`, I used this:

```
my $ent = chr(8855);
my $num = chr(8854);
my $unencoded = "$ent$num" x 10;

my $text = <<"HTML";
text in "$unencoded"
HTML

timethese( 1_000_000,
    {
        encode => sub { my $x = encode_entities($text) },
    }
);
```

Results:

42,281/s for the original unmodified `encode_entities`.

52,746/s if the `encode_entities` used the caching `num_entity` first mentioned, but the main regex is unchanged.

64,769/s if the main conversion regex caches the results of calls to `num_entity` in `%char2entity`. Changing this to call the caching `num_entity` gave no noticeable improvement.

----

I hope these give some ideas. `encode_entities` is an absolute workhorse at my job (we generate everything with Template Toolkit), and I'm sure for many many others. Any speedup would have wide-ranging benefits.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Potential speedups for HTML::Entities::encode_entities #30

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Potential speedups for HTML::Entities::encode_entities #30

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions