Skip to content

Metadata 1#16

Draft
toschoosqd wants to merge 24 commits intomainfrom
metadata-1
Draft

Metadata 1#16
toschoosqd wants to merge 24 commits intomainfrom
metadata-1

Conversation

@toschoosqd
Copy link
Contributor

First iteration for metadata with some preliminary ideas.
Highly relevant is #3

Updated metadata documentation with corrections and clarifications.
Updated terminology and corrected size units in metadata documentation. Added sections on conversions and statistics, and improved clarity in various explanations.
Expanded the metadata document to include detailed discussions on metadata purpose, goals, and formats, along with specific short-term and long-term objectives.
Revised the metadata document to improve clarity and consistency in language, including updates to the purpose, schema definitions, and type system descriptions.
Corrected spelling of 'modelling' to 'modeling' throughout the document.
Added comment to clarify hash type options.
Updated metadata document to improve clarity and consistency in terminology, including changes to key definitions and type representations.
Corrected a typo in the metadata documentation regarding chunk summary.
Add comment regarding key range sharding and chunk handling.
@dzhelezov
Copy link
Contributor

One of the key architectural decisions to be made is if we put the dataset properties (that, is the properties of the data itself) into the metadata, or leave it purely schema oriented. Introducing the statistics already suggests that we want to be data aware here. Then, we should also think where we store:

  • the data location
  • the data updates history (git-like?)

A previous attempt to design such a location aware and update/edit-aware metadata file was made in this issue: https://github.com/subsquid/datas3ts/issues/5

There the design was built around the following properties:

  • the schema is immutable and the immutable part of the metadata should be self-sertified
  • any dataset snapshot can be identified by a single hash (so that it can be published on-chain or elsewhere, and it will self-certify the full data in there, similar to how merkle trees work)
  • the locations of the dataset files are part of the dataset descriptions but can be updated at any time, so that the workers may download the nessary chunks from multiple locations (eg ipfs, s3, etc)
  • data appends are efficient and don't require a full re-calculation of the dataset hashes, only incremental updates

Some extra thought and care should be taken in order to not trigger expensive list operations -- similar how we currently do that by placing the files in a tree-like directory structure in s3 buckets

Copy link

@define-null define-null left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a first pass on this draft, so left a bunch of comments and questions.

- Hand-crafted ingestion pipelines

- Validation and Parsing of data for different components
(portals, workers, SDKs, the DuckDB extension).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the important questions that is not clear to me from the draft - wherever the intent is to implement schema-on-read or schema-on-write architecture? In the former case we are talking about the traditional data lakes with raw/semi unstructured data, with rather limited correctness check and schema applied when running the query (with less integrity constraints that are enforced on read). In the later we are aiming for the more strict correctness (integrity constraints enforced on write) and consistency.

From that perspective I'm not sure if https://github.com/subsquid/specs/pull/3/changes is assumed in this document or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my understanding this is precisely what distinguishes long term and short term: in the long run, we want to have schema-on-write. But we don't have that end of February.

Comment on lines +101 to +102
In the future, we may add statistics to columns or row groups
to accelerate ingestion and, in particular, retrieval.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we aim for analytical use cases statistics would be essential, as in such systems the common pattern is to use the pruning techniques to further reduce the subset of data involved in query execution. So in my view we should priorities column and row statistics such as min/max, cardinality, null counts, etc from the start

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. But first we have to setup the infrastructure to get stats in the first place. For append-only parquets, we can generate stats with ingestion. For hotblocks it is a bit more difficult. In general, I imagine something like assignment for statistics.


Types are distinguished into **primitive types** and **complex types**.

Primitive types are defined in terms of

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When it comes to types in my view it's important to consider several factors:

  • what is the minimum subset of types that we need for a POC?
  • what is the compatibility story with other existing dbs and engines?
  • what are the conversion rules that we would like to have for those types?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Minimum viable subset of types: integer, float, string, char, bool, blob, list/array, object. (The last three are missing or incomplete).
  • Concerning compatibility I was focusing on the components we have to address now: blockchain data, parquets, duckdb, Rust, C++, Typescript, Python.
  • Conversion rules: reading/writing from/to the named components.

Comment on lines +145 to +148
Schemas shall also include elements for defining real-time data.
This may include an endpoint from which data is read,
and a stored procedure (or equivalent processing step)
that transforms data and passes it on to an internal API.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm that I understand you correctly - are we talking about ETL pipeline here with possibility to specify the transformation part?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wording is very abstract - and indeed hard to understand. I try to generalise how we handle hotblocks: we have a set of endpoints as data sources and a processing step to do something with this data (e.g. store in a temporary database).

Added an overview section and updated the discussion on data types.
Updated metadata structure and elements sections, clarified indexing and routing strategies, and proposed SQL for schema management.
Updated section headings and added references to sections for better organization and clarity.
Clarified scope for the first iteration of metadata processing, detailing in-scope and out-of-scope items. Expanded sections on archive and real-time data, including stored procedures and statistics.
Added detailed explanation of IPLD, its structure, and use cases.
Updated markdown links and fixed formatting issues throughout the document.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants