Skip to content

undocumented string hosts entries are parsed unsafely with strtok on a Python Unicode buffer #1057

Description

@mpszn

Python client 19.2.1: undocumented string hosts entries are parsed unsafely with strtok on a Python Unicode buffer

Summary

The Python client documentation shows hosts as tuples like [("127.0.0.1", 3000)], but the constructor also accepts bare string entries such as "host:3000". That string path is parsed unsafely: it casts away const from PyUnicode_AsUTF8(py_host) and passes the resulting pointer to strtok().

This means the client silently accepts an undocumented config shape and parses it via mutation of Python-owned string storage. Even when it appears to work, that path is not safe.

Environment

  • Aerospike Python client 19.2.1
  • Reproduced while investigating startup behavior on Debian Bookworm with Python 3.11.2
  • Also relevant on Debian Trixie with Python 3.13.5 because the parser logic is in the shared Python extension code

Minimal Reproduction

from aerospike import Client

conf = {
    "hosts": ["seed1.example:3000"],
}

client = Client(conf)

Expected Behavior

One of these should happen:

  • only the documented tuple form should be accepted, with a clear validation error for strings
  • or string entries should be explicitly documented and parsed safely without mutating Python-owned buffers

Actual Behavior

  • The undocumented string form is silently accepted.
  • The parser calls strtok((char *)PyUnicode_AsUTF8(py_host), ":").
  • It then reparses from a duplicate string and uses atoi() for the port with no strict validation.
  • IPv6-style addresses are inherently broken by colon-splitting.

Impact

  • Undefined behavior from mutating the buffer returned by PyUnicode_AsUTF8().
  • Silent acceptance of an undocumented config shape.
  • Weak validation of the port component.
  • Hard-to-debug, environment-sensitive behavior when users rely on "host:port" shorthand.

Technical Analysis

The documented and tuple-based path is straightforward: the constructor reads a tuple, copies the address string, and requires the port to be an integer.

The string path is materially different:

  • it accepts a bare Unicode string instead of a tuple
  • it casts away const from PyUnicode_AsUTF8()
  • it tokenizes that pointer with strtok()
  • it then tokenizes a duplicated copy again
  • it uses atoi() for the port without strict syntax checking

That makes the string shorthand qualitatively less safe than the documented tuple form.

Relevant Source Locations

Verified against the extracted 19.2.1 source tree.

  • src/main/aerospike.c:51-57 documents hosts as tuple entries like [("127.0.0.1", 3000)].
  • src/main/client/type.c:758-784 shows the tuple path, including integer validation for the port.
  • src/main/client/type.c:787-794 shows the undocumented string path and the strtok((char *)PyUnicode_AsUTF8(py_host), ":") call.
  • src/main/client/type.c:795-806 adds the parsed host to the config if any address string was produced.

Suggested Fixes

  • Reject bare string entries in hosts with a clear error message, keeping only the documented tuple form.
  • Or, if string entries are intended to be supported, parse them from a copied buffer without mutating Python-owned storage.
  • Replace atoi() with strict validation.
  • Explicitly define whether IPv6 literals are supported in this input form.

Likely Fix Scope

  • Primary fix surface is the Python wrapper constructor in src/main/client/type.c.
  • The lowest-risk fix is to reject bare string hosts entries outright and enforce the documented tuple form.
  • If compatibility requires keeping string support, the parser should be rewritten locally in that same constructor path to operate on copied buffers, validate ports strictly, and define IPv6 behavior explicitly.
  • Risk is low if the undocumented shorthand is rejected, and low to medium if compatibility parsing is retained.
  • The most useful regression tests would cover documented tuple inputs, invalid string inputs, port validation failures, and any explicitly supported IPv6 forms.

Notes

We stumbled upon this bug during a search for the cause of a mysterious slow-down on client start-up which we eventually traced to some problems with the shared memory feature. I will post separate issues for that.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions