TableReference
String Guide
TableReference
strings provide a powerful way to dynamically fetch and use data from other table instances, or the current table instance, directly within most string-based fields of your builder YAML files. This allows for highly dynamic and data-driven configurations.
Core Concept: The << ... >>
Wrapper
The fundamental syntax for a TableReference
string is to enclose the reference string within double angle brackets: << ... >>
.
The << ... >>
syntax can be applied to any singular string, boolean or numeric entry in your builder definition, such as values in the arguments
block, items in the changed_columns
list, or even the python_function
and code_module
names themselves.
The primary exception is the dependencies
field, which does not support this dynamic resolution (dependencies
inform the system which tables are loaded and must be resolved before TableReference
strings are parsed).
Example of broad usage in a builder YAML:
# In an IndexBuilder or ColumnBuilder
builder_type: IndexBuilder
python_function: "<<config_table.builder_function_name[use_case::'main_index']>>" # Dynamic function name
code_module: "<<config_table.builder_module_name>>" # Dynamic module name
changed_columns:
- "<<config_table.primary_output_column>>" # Dynamic column name
- "fixed_secondary_column"
- "<<self.another_dynamic_column_name_source[index]>>" # Column name derived from self
arguments:
source_file_path: "/data/raw/<<config_table.file_name[source_id::'source_A']>>.csv"
lookup_value: "<<lookup_table.value[key::<<self.current_key[index]>>]>>"
static_text_with_ref: "Report for ID: <<self.id[index]>>"
If a field value is entirely a TableReference
string (e.g., python_function: "<<config.func_name>>"
), the resolved value of the reference will be used directly. If the reference is part of a larger string (e.g., in arguments
), the resolved value will be converted to a string and substituted into place.
Anatomy of a TableReference
String
Inside the << ... >>
wrapper, a TableReference
string follows a specific structure to identify the table, an optional instance_id, specific columns, and optional filtering conditions:
All parts (instance_id, columns, conditions) are optional.
Components:
tableName
: The name of the table to query.(instance_id)
: (Optional) Specifies a particular instance of the table..{columns}
or.COLUMN
: (Optional) Selects specific columns.[conditions]
: (Optional) Filters the rows of the table.
1. Table Name (tableName
)
- Syntax: A string of alphanumeric characters, underscores (
_
), or hyphens (-
). -
Special Keyword
self
: The keywordself
refers to the current table instance being processed by the builder. -
Dynamic Table Name: The table name itself can be a nested
TableReference
string.- Example in a field:
code_module: "<< <<table_map.module_column[type::'etl']>> >>"
- Reference string example:
<< <<another_table.config_key[type::'source']>>.data_column >>
- Example in a field:
self
Keyword Restrictions
The self
keyword can only be used in the argument
field.
2. Version ((instance_id)
)
- Syntax: Instance in parentheses, e.g.,
(base_1748113624_d049944b-8548-46d2-a247-bbf3769fbadc)
. - Optional: If omitted, TableVault will typically use the latest available instance of the table based on its internal logic.
- Dynamic Version: The version string can be a nested
TableReference
.- Example reference string:
my_table(<<version_control_table.active_version[table_name::'my_table']>>)
- Example reference string:
3. Columns (.{columns}
or .COLUMN
)
- Syntax:
- Single Column: Preceded by a dot (
.
), e.g.,.user_id
. - Multiple Columns: Preceded by a dot (
.
) and enclosed in curly braces{}
, with column names separated by commas, e.g.,.{name,email,age}
.
- Single Column: Preceded by a dot (
- Optional:
- If omitted, and conditions are present, all columns are available for filtering, and the selected columns depend on the output simplification (see below).
- If omitted, and no conditions are present, the entire DataFrame (or its simplified form) is returned.
- Dynamic Column Names: Column names within the list (or the single column name) can be nested
TableReferences
. This is highly relevant for fields likechanged_columns
orprimary_key
.- Example in
changed_columns
:changed_columns: ["id", "<<config_table.main_data_field_name>>"]
- Example reference string for a column name:
my_table.<<config.target_column>>
- Example reference string with multiple dynamic columns:
my_table.{id,<<audit_table.tracked_field[user::'admin']>>,status}
- Example in
- Special Case:
self.index
: If you use<<self.index>>
this specifically resolves to the current row's physical index value during row-wise operations.
4. Conditions ([conditions]
)
- Syntax: Enclosed in square brackets
[...]
. Multiple conditions are separated by commas,
. - Optional: If omitted, all rows (of the selected version and columns) are considered.
- Each condition specifies a column to filter on and the criteria.
Condition Types:
-
Equality (
columnName::value
):- Filters rows where
columnName
equalsvalue
. - Example reference string:
orders.product_id[customer_id::'cust123',status::'shipped']
- The
value
is automatically quoted for string comparisons if not already quoted (e.g.,status::shipped
becomesstatus == 'shipped'
). Numerical values are used directly. value
can be a nested aTableReference
string:orders.items[user_id::<<user_table.id[username::'jdoe']>>]
- Filters rows where
-
Range (
columnName::start_value:end_value
):- Filters rows where
columnName
is greater than or equal tostart_value
AND less thanend_value
. - Example reference string:
events.timestamp[timestamp::'2023-01-01T00:00:00':'2023-01-01T23:59:59']
start_value
andend_value
can be literals or nestedTableReference
strings. Values are formatted appropriately for comparison based on the column's data type.
- Filters rows where
-
Implicit Index/Contextual Value (
columnName
):- Filters rows where
columnName
equals a contextually providedindex
value (the physical index of the row currently being processed by the builder). - Example: If processing row
101
ofself
, then the reference<<other_table.data_column[join_key_in_other_table]>>
(within an argument) would attempt to find rows inother_table
wherejoin_key_in_other_table == 101
. - This is particularly useful for lookups related to the current item in
self
. - If
self.some_column[key_column]
is used, andindex
is defined, it impliesself.some_column
wherekey_column == index
.
- Filters rows where
-
Dynamic Keys and Values: All parts of a condition (the column name, the value, start/end values) can be nested
TableReference
strings.- Example reference string:
my_table[<<config.filter_column>>::<<config.filter_value>>]
- Example reference string:
index
Condition
The index
keyword can only be used in the arguments
key of a row-wise function (when row-wise
is set to true
).
Nested References
As shown in examples above, any component of a TableReference
string —the table name, version string, column names, condition keys, or condition values—can itself be another TableReference
stringenclosed in << ... >>
. TableVault will resolve the innermost references first and use their results to construct the outer reference before resolving it.
Complex Example (from code, used in an argument):
<<stories.artifact_name[paper_name::<<self.paper_name[index]>>]>>
-
<<self.paper_name[index]>>
: Resolves first. It fetches thepaper_name
from the current row (index
) of theself
table. -
Let's say the above yields
'my_research_paper'
. -
The outer reference becomes:
<<stories.artifact_name[paper_name::'my_research_paper']>>
. -
This then fetches
artifact_name
from thestories
table wherepaper_name
is'my_research_paper'
.
Resolution and Data Retrieval
- When a builder is executed, TableVault parses these reference strings from the relevant YAML fields.
- It uses an internal cache of DataFrames (for already loaded tables and versions) to retrieve data efficiently.
- For references involving
self
or implicit index conditions, the context of the current row being processed (often an integerindex
) is crucial for resolving the correct data. - The recursive parsing handles references within lists, dictionaries, and other nested structures in the YAML, as long as they ultimately resolve to strings or collections of strings where references are found.
Examples of Reference Strings
These examples illustrate the reference string syntax itself. These strings would be placed inside << >>
within a suitable YAML field.
-
Fetch a single column from another table:
my_data_table.user_email
- Result: A dataframe with a single
user_email
column (might be converted to a list contextually).
- Result: A dataframe with a single
-
Get a specific value using a filter:
users_table.full_name[user_id::'user-007']
- Result: A single string representing
full_name
foruser_id
.
- Result: A single string representing
-
Get a value from
self
based on the current row's context (implicit index):self.status[id_column_of_self]
- Result: A single boolean representing
self.status
whereid_column_of_self == index
.
- Result: A single boolean representing
-
Reference with a specific version:
app_settings(base_1748275064_5782ef5b-4023-4618-a419-cf921c365c64).timeout_ms
- Result: A dataframe with a single
user_email
column that is from thebase_1748275064_5782ef5b-4023-4618-a419-cf921c365c64
instance.
- Result: A dataframe with a single
-
Using a range condition:
transactions.amount[transaction_date::'2024-01-01':'2024-01-31']
- Result: A dataframe with a single
amount
column with transactions between2024-01-01
and2024-01-31
(if properly sorted). - Note the range is calculated by the physical index.
- Result: A dataframe with a single
-
Nested reference for dynamic filtering:
preferences.setting_value[user_id::<<self.user_identifier[index]>>, setting_key::'theme']
- Result: A single string representing
setting_value
foruser_id
at current index with thetheme
key.
- Result: A single string representing
Error Handling
If a TableReference
string is malformed (e.g., unbalanced brackets, illegal characters) or if a reference cannot be resolved at runtime (e.g., table not found, column missing, nested reference fails), a TableReferenceError
will typically be raised, halting the builder process. Ensure your references are correct and the data they point to exists.