Metadata Documentation
This document describes the various metadata files generated by a TableVault repository to track table states and transformations.
While these files should never be manually modified by users, they provide essential insights into the repository's internal state. Below is a detailed explanation of each metadata file type and its contents.
General Metadata
metadata/logs.txt and metadata/log_ids.txt
These files record information about completed logs. logs.txt includes comprehensive details, whereas log_ids.txt only lists the process_id.
metadata/active_logs.json
This file stores details about all currently active processes. Users can directly query this data through the TableVault API.
metadata/tables_history.json, metadata/columns_history.json, and metadata/tables_temp.json
These files maintain a historical record of all fully stored table instances:
-
tables_history.jsonrecords: -
When a dataframe was first created (potentially by a different table instance).
- When an instance was initially materialized.
-
When an instance ceased to be active.
-
columns_history.jsontracks creation at the column level. -
tables_temp.jsonrecords temporary table instances.
These files are used internal operations, optimization strategies, and historical tracking.
locks/*
TableVault implements custom file-based read-write locks to enable multiprocessing capabilities.
_temp/*
This directory temporarily stores previous file states during active operations. If an operation fails, it allows safe restoration of the repository's previous state.
metadata/ARCHIVED_TRASH/*
Upon deletion of tables and instances, their dataframes and artifacts are removed, but the associated metadata is archived in this folder. This feature preserves historical context.
Note
Files within this folder can safely be deleted if storage space is limited. Typically, these files occupy minimal space.
lock.LOCK file
This lock file ensures exclusive write access to the repository metadata, preventing concurrent write operations.
.tablevault file
This file identifies the directory explicitly as a TableVault repository.
Table and Instance Metadata
descriptions.yaml
Each TableVault repository, table, and instance has a dedicated YAML file containing specific metadata. While some metadata is automatically generated, users can include optional free-form descriptions during creation. This capability allows arbitrary contextual details to be preserved.
dtypes.json
For each materialized instance, a dtypes.json file specifies the data types of all columns. This is particularly useful for managing custom data types and tracking artifact_string columns.
EXECUTION_ARCHIVE/* folder
Each executed instance contains an EXECUTION_ARCHIVE folder, explicitly documenting the Python functions executed during the instance's lifecycle.