Skip to content

S3DataStore shards published without endpoint in ATProto records #81

@maxine-at-forecast

Description

@maxine-at-forecast

Summary

When using Index.write_samples() with an S3DataStore and an atmosphere target (@handle/name), the resulting ATProto record uses StorageHttp with raw s3://bucket/path URLs instead of StorageS3 with the bucket, keys, and endpoint.

This means consumers of the published dataset record have no way to resolve the shard URLs — they lack the S3 endpoint needed to connect.

Reproduction

import atdata
from atdata.stores._s3 import S3DataStore
from atdata.atmosphere.client import Atmosphere

store = S3DataStore(
    credentials={
        "AWS_ENDPOINT": "https://account.r2.cloudflarestorage.com",
        "AWS_ACCESS_KEY_ID": "...",
        "AWS_SECRET_ACCESS_KEY": "...",
    },
    bucket="my-bucket",
)
atmo = Atmosphere.login("handle", "password", base_url="https://my-pds.example.com")
index = atdata.Index(atmosphere=atmo, data_store=store)

entry = index.write_samples(
    samples,
    name="@handle/my-dataset",
    data_store=store,
)
# entry.data_urls contains: ["s3://my-bucket/my-dataset/data--uuid--000000.tar"]
# ATProto record uses StorageHttp with these as URLs — no endpoint info

Expected behavior

When data_store is an S3DataStore, the published ATProto record should use StorageS3 with:

  • bucket: from the S3DataStore
  • endpoint: from the S3DataStore credentials
  • shards[].key: the key portion of each shard URL

This is already supported by DatasetPublisher.publish_with_s3() — it just isn't wired up in the Index.write_samplesinsert_dataset path.

Workaround

Manually call S3DataStore.write_shards() then DatasetPublisher.publish_with_s3() with explicit bucket/keys/endpoint.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions