Skip to content

Metadata Documentation

This document describes the various metadata files generated by a TableVault repository to track table states and transformations.

While these files should never be manually modified by users, they provide essential insights into the repository's internal state. Below is a detailed explanation of each metadata file type and its contents.


General Metadata

metadata/logs.txt and metadata/log_ids.txt

These files record information about completed logs. logs.txt includes comprehensive details, whereas log_ids.txt only lists the process_id.

metadata/active_logs.json

This file stores details about all currently active processes. Users can directly query this data through the TableVault API.

metadata/tables_history.json, metadata/columns_history.json, and metadata/tables_temp.json

These files maintain a historical record of all fully stored table instances:

  • tables_history.json records:

  • When a dataframe was first created (potentially by a different table instance).

  • When an instance was initially materialized.
  • When an instance ceased to be active.

  • columns_history.json tracks creation at the column level.

  • tables_temp.json records temporary table instances.

These files are used internal operations, optimization strategies, and historical tracking.

locks/*

TableVault implements custom file-based read-write locks to enable multiprocessing capabilities.

_temp/*

This directory temporarily stores previous file states during active operations. If an operation fails, it allows safe restoration of the repository's previous state.

metadata/ARCHIVED_TRASH/*

Upon deletion of tables and instances, their dataframes and artifacts are removed, but the associated metadata is archived in this folder. This feature preserves historical context.

Note

Files within this folder can safely be deleted if storage space is limited. Typically, these files occupy minimal space.

lock.LOCK file

This lock file ensures exclusive write access to the repository metadata, preventing concurrent write operations.

.tablevault file

This file identifies the directory explicitly as a TableVault repository.


Table and Instance Metadata

descriptions.yaml

Each TableVault repository, table, and instance has a dedicated YAML file containing specific metadata. While some metadata is automatically generated, users can include optional free-form descriptions during creation. This capability allows arbitrary contextual details to be preserved.

dtypes.json

For each materialized instance, a dtypes.json file specifies the data types of all columns. This is particularly useful for managing custom data types and tracking artifact_string columns.

EXECUTION_ARCHIVE/* folder

Each executed instance contains an EXECUTION_ARCHIVE folder, explicitly documenting the Python functions executed during the instance's lifecycle.