-
Notifications
You must be signed in to change notification settings - Fork 14
Description
I think we could get some speedups in encode_entities by caching some common operations. The examples below would most benefit users encoding lots of data that is heavy on non-named, numeric entities. We may have similar benefits elsewhere.
The first is to cache the results of the sprintf in num_entity. This could be done with no effect on behavior, in exchange for some hash entries. Here are my experiments.
use Benchmark ':all';
use HTML::Entities;
timethese( 300_000,
{
original => sub {
map { HTML::Entities::num_entity($_) } 1..100
},
cached => sub {
map { cached_num_entity($_) } 1..100
},
}
);
my %cache;
sub cached_num_entity {
return $cache{$_[0]} ||= sprintf("&#x%X;", ord($_[0]));
}
gives these results:
cached: 10 wallclock secs (10.52 usr + 0.00 sys = 10.52 CPU) @ 28517.11/s (n=300000)
original: 15 wallclock secs (14.28 usr + 0.00 sys = 14.28 CPU) @ 21008.40/s (n=300000)
Bottom line: Hash lookup is faster than the sprintf, so let's cache it.
The other tweak would be to cache the call to num_entity inside the main regex in encode_entities. Swap this:
$$ref =~ s/([^\n\r\t !\#\$%\(-;=?-~])/$char2entity{$1} || num_entity($1)/ge;
for this
$$ref =~ s/([^\n\r\t !\#\$%\(-;=?-~])/$char2entity{$1} ||= num_entity($1)/ge;
This would have the side effect of modifying the %char2entity hash, which is visible to the outside world. If that wasn't OK, we could have a private copy of the hash specifically so it would be modifiable. The potential downside (or upside?) of that would be that if someone outside the module modified %char2entity, it would have no effect on encode_entities.
For benchmarking encode_entities, I used this:
my $ent = chr(8855);
my $num = chr(8854);
my $unencoded = "$ent$num" x 10;
my $text = <<"HTML";
text in "$unencoded"
HTML
timethese( 1_000_000,
{
encode => sub { my $x = encode_entities($text) },
}
);
Results:
42,281/s for the original unmodified encode_entities.
52,746/s if the encode_entities used the caching num_entity first mentioned, but the main regex is unchanged.
64,769/s if the main conversion regex caches the results of calls to num_entity in %char2entity. Changing this to call the caching num_entity gave no noticeable improvement.
I hope these give some ideas. encode_entities is an absolute workhorse at my job (we generate everything with Template Toolkit), and I'm sure for many many others. Any speedup would have wide-ranging benefits.