Skip to content

Potential speedups for HTML::Entities::encode_entities #30

@petdance

Description

@petdance

I think we could get some speedups in encode_entities by caching some common operations. The examples below would most benefit users encoding lots of data that is heavy on non-named, numeric entities. We may have similar benefits elsewhere.

The first is to cache the results of the sprintf in num_entity. This could be done with no effect on behavior, in exchange for some hash entries. Here are my experiments.

use Benchmark ':all';
use HTML::Entities;

timethese( 300_000,
    {
        original => sub {
            map { HTML::Entities::num_entity($_) } 1..100
        },
        cached => sub {
            map { cached_num_entity($_) } 1..100
        },
    }
);


my %cache;
sub cached_num_entity {
    return $cache{$_[0]} ||= sprintf("&#x%X;", ord($_[0]));
}

gives these results:

    cached: 10 wallclock secs (10.52 usr +  0.00 sys = 10.52 CPU) @ 28517.11/s (n=300000)
  original: 15 wallclock secs (14.28 usr +  0.00 sys = 14.28 CPU) @ 21008.40/s (n=300000)

Bottom line: Hash lookup is faster than the sprintf, so let's cache it.


The other tweak would be to cache the call to num_entity inside the main regex in encode_entities. Swap this:

$$ref =~ s/([^\n\r\t !\#\$%\(-;=?-~])/$char2entity{$1} || num_entity($1)/ge;

for this

$$ref =~ s/([^\n\r\t !\#\$%\(-;=?-~])/$char2entity{$1} ||= num_entity($1)/ge;

This would have the side effect of modifying the %char2entity hash, which is visible to the outside world. If that wasn't OK, we could have a private copy of the hash specifically so it would be modifiable. The potential downside (or upside?) of that would be that if someone outside the module modified %char2entity, it would have no effect on encode_entities.

For benchmarking encode_entities, I used this:

my $ent = chr(8855);
my $num = chr(8854);
my $unencoded = "$ent$num" x 10;

my $text = <<"HTML";
text in "$unencoded"
HTML

timethese( 1_000_000,
    {
        encode => sub { my $x = encode_entities($text) },
    }
);

Results:

42,281/s for the original unmodified encode_entities.

52,746/s if the encode_entities used the caching num_entity first mentioned, but the main regex is unchanged.

64,769/s if the main conversion regex caches the results of calls to num_entity in %char2entity. Changing this to call the caching num_entity gave no noticeable improvement.


I hope these give some ideas. encode_entities is an absolute workhorse at my job (we generate everything with Template Toolkit), and I'm sure for many many others. Any speedup would have wide-ranging benefits.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions