Types and Type Queries

TableVault supports five list categories: file lists, document lists, embedding lists, record lists, and process lists.

You can query each type directly through the TableVault API. This page summarizes the characteristics of each type and the corresponding query patterns.

Data Types

File List

Each file data item contains a location that represents the access location of an external file. Note that TableVault does not verify this location internally, and any string is accepted.

# Create a file list
vault.create_file_list("trained_models")

# Append file references
vault.append_file("trained_models", "models/classifier_v1.pkl")
vault.append_file("trained_models", "models/classifier_v2.pkl")

# Append with input dependencies (links to other items)
vault.append_file(
    "trained_models",
    "models/ensemble.pkl",
    input_items={"training_data": [0, 100]}  # depends on positions 0-100 of training_data
)

Document List

Each document data item contains a text string that represents textual content. Documents are useful for storing text chunks, logs, or any string-based data.

# Create a document list
vault.create_document_list("research_notes")

# Append documents
vault.append_document("research_notes", "Initial experiment showed 95% accuracy.")
vault.append_document("research_notes", "Second run with more data: 97% accuracy.")

# Append with specific position tracking
vault.append_document(
    "research_notes",
    "Final results summary.",
    index=2,
    start_position=100
)

Embedding List

Each embedding list consists of one-dimensional arrays of floats with a fixed dimension. All embeddings in a list must have the same dimensionality.

# Create an embedding list (specify dimension)
vault.create_embedding_list("document_embeddings", ndim=1024)

# Append embeddings
embedding_vector = [0.1, 0.2, ...]  # 1024-dimensional vector
vault.append_embedding("document_embeddings", embedding_vector)

# Append with input dependencies
vault.append_embedding(
    "document_embeddings",
    another_embedding,
    input_items={"research_notes": [0, 1]},  # embedding derived from document at position 0-1
    build_idx=True  # rebuild vector index for similarity search
)

Record List

Each record list stores structured data as dictionaries with predefined column names. Records are useful for storing tabular or structured metadata.

# Create a record list with column names
vault.create_record_list("experiment_results", column_names=["model", "accuracy", "params"])

# Append records
vault.append_record("experiment_results", {
    "model": "random_forest",
    "accuracy": 0.95,
    "params": {"n_estimators": 100, "max_depth": 10}
})

vault.append_record("experiment_results", {
    "model": "xgboost",
    "accuracy": 0.97,
    "params": {"learning_rate": 0.1, "n_estimators": 200}
})

Process List

Process lists are special lists that are automatically generated by TableVault for each process instance. When you initialize a Vault object in a Python process, subsequently executed code is stored within a process list. In a Python script, the entire code file is stored as one data item in a process list. In a Python notebook, each executed cell is stored as a data item (in order).

The process parent of a new process list is explicitly defined during initialization of the Vault object with parent_process_name and parent_process_index. You should define those values if one process is being created by another executing process.

Type Search

For each data type in a TableVault repository, you can search over all stored data items of that type. All query methods support common filtering parameters:

description_embedding: Embedding vector for similarity search on item descriptions
description_text: Text to search in item descriptions
code_text: Text to search in the process code that created/modified the items
filtered: List of item names to restrict search to

File List Queries

Query file items by their descriptions or the code that created them.

# Query all file items
results = vault.query_file_list()

# Query with description text filter
results = vault.query_file_list(description_text="classifier model")

# Query with code text filter (find files created by specific code)
results = vault.query_file_list(code_text="train_model")

# Query with description embedding similarity
results = vault.query_file_list(description_embedding=query_vector)

# Restrict search to specific file lists
results = vault.query_file_list(
    description_text="model",
    filtered=["trained_models", "checkpoints"]
)

Document List Queries

Query document items by their text content, descriptions, or creating code.

# Query by document text content
results = vault.query_document_list(document_text="accuracy")

# Query with multiple filters
results = vault.query_document_list(
    document_text="experiment",
    description_text="research",
    code_text="analyze"
)

# Restrict to specific document lists
results = vault.query_document_list(
    document_text="results",
    filtered=["research_notes", "logs"]
)

Embedding List Queries

Query embedding items by vector similarity. Supports both exact and approximate nearest neighbor search.

# Query by embedding similarity (exact search)
query_embedding = [0.1, 0.2, ...]  # must match list dimensionality
results = vault.query_embedding_list(embedding=query_embedding)

# Use approximate search for faster results on large datasets
results = vault.query_embedding_list(
    embedding=query_embedding,
    use_approx=True
)

# Combine with description and code filters
results = vault.query_embedding_list(
    embedding=query_embedding,
    description_text="document embedding",
    code_text="encode_text"
)

# Restrict to specific embedding lists
results = vault.query_embedding_list(
    embedding=query_embedding,
    filtered=["document_embeddings", "image_embeddings"]
)

Record List Queries

Query record items by text content within the records.

# Query by record text content
results = vault.query_record_list(record_text="random_forest")

# Query with multiple filters
results = vault.query_record_list(
    record_text="accuracy",
    description_text="experiment",
    code_text="evaluate"
)

# Restrict to specific record lists
results = vault.query_record_list(
    record_text="xgboost",
    filtered=["experiment_results"]
)

Process List Queries

Query process items by code content, parent process code, or descriptions.

# Query by code text in process
results = vault.query_process_list(code_text="import pandas")

# Query by parent process code
results = vault.query_process_list(parent_code_text="spawn_worker")

# Query with description similarity
results = vault.query_process_list(
    description_embedding=query_vector,
    description_text="training pipeline"
)

# Combine multiple filters
results = vault.query_process_list(
    code_text="model.fit",
    parent_code_text="hyperparameter_search",
    filtered=["training_process_1", "training_process_2"]
)