TableOptimizer
deltalake.table.TableOptimizer
API for various table optimization commands.
compact
compact(partition_filters: Optional[FilterConjunctionType] = None, target_size: Optional[int] = None, max_concurrent_tasks: Optional[int] = None, min_commit_interval: Optional[Union[int, timedelta]] = None, writer_properties: Optional[WriterProperties] = None, custom_metadata: Optional[Dict[str, str]] = None, post_commithook_properties: Optional[PostCommitHookProperties] = None, commit_properties: Optional[CommitProperties] = None) -> Dict[str, Any]
Compacts small files to reduce the total number of files in the table.
This operation is idempotent; if run twice on the same table (assuming it has not been updated) it will do nothing the second time.
If this operation happens concurrently with any operations other than append, it will fail.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
partition_filters |
Optional[FilterConjunctionType]
|
the partition filters that will be used for getting the matched files |
None
|
target_size |
Optional[int]
|
desired file size after bin-packing files, in bytes. If not
provided, will attempt to read the table configuration value |
None
|
max_concurrent_tasks |
Optional[int]
|
the maximum number of concurrent tasks to use for file compaction. Defaults to number of CPUs. More concurrent tasks can make compaction faster, but will also use more memory. |
None
|
min_commit_interval |
Optional[Union[int, timedelta]]
|
minimum interval in seconds or as timedeltas before a new commit is created. Interval is useful for long running executions. Set to 0 or timedelta(0), if you want a commit per partition. |
None
|
writer_properties |
Optional[WriterProperties]
|
Pass writer properties to the Rust parquet writer. |
None
|
custom_metadata |
Optional[Dict[str, str]]
|
Deprecated and will be removed in future versions. Use commit_properties instead. |
None
|
post_commithook_properties |
Optional[PostCommitHookProperties]
|
properties for the post commit hook. If None, default values are used. |
None
|
commit_properties |
Optional[CommitProperties]
|
properties of the transaction commit. If None, default values are used. |
None
|
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
the metrics from optimize |
Example
Use a timedelta object to specify the seconds, minutes or hours of the interval.
from deltalake import DeltaTable, write_deltalake
from datetime import timedelta
import pyarrow as pa
write_deltalake("tmp", pa.table({"x": [1], "y": [4]}))
write_deltalake("tmp", pa.table({"x": [2], "y": [5]}), mode="append")
dt = DeltaTable("tmp")
time_delta = timedelta(minutes=10)
dt.optimize.compact(min_commit_interval=time_delta)
{'numFilesAdded': 1, 'numFilesRemoved': 2, 'filesAdded': ..., 'filesRemoved': ..., 'partitionsOptimized': 1, 'numBatches': 2, 'totalConsideredFiles': 2, 'totalFilesSkipped': 0, 'preserveInsertionOrder': True}
z_order
z_order(columns: Iterable[str], partition_filters: Optional[FilterConjunctionType] = None, target_size: Optional[int] = None, max_concurrent_tasks: Optional[int] = None, max_spill_size: int = 20 * 1024 * 1024 * 1024, min_commit_interval: Optional[Union[int, timedelta]] = None, writer_properties: Optional[WriterProperties] = None, custom_metadata: Optional[Dict[str, str]] = None, post_commithook_properties: Optional[PostCommitHookProperties] = None, commit_properties: Optional[CommitProperties] = None) -> Dict[str, Any]
Reorders the data using a Z-order curve to improve data skipping.
This also performs compaction, so the same parameters as compact() apply.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
columns |
Iterable[str]
|
the columns to use for Z-ordering. There must be at least one column. |
required |
partition_filters |
Optional[FilterConjunctionType]
|
the partition filters that will be used for getting the matched files |
None
|
target_size |
Optional[int]
|
desired file size after bin-packing files, in bytes. If not
provided, will attempt to read the table configuration value |
None
|
max_concurrent_tasks |
Optional[int]
|
the maximum number of concurrent tasks to use for file compaction. Defaults to number of CPUs. More concurrent tasks can make compaction faster, but will also use more memory. |
None
|
max_spill_size |
int
|
the maximum number of bytes allowed in memory before spilling to disk. Defaults to 20GB. |
20 * 1024 * 1024 * 1024
|
min_commit_interval |
Optional[Union[int, timedelta]]
|
minimum interval in seconds or as timedeltas before a new commit is created. Interval is useful for long running executions. Set to 0 or timedelta(0), if you want a commit per partition. |
None
|
writer_properties |
Optional[WriterProperties]
|
Pass writer properties to the Rust parquet writer. |
None
|
custom_metadata |
Optional[Dict[str, str]]
|
Deprecated and will be removed in future versions. Use commit_properties instead. |
None
|
post_commithook_properties |
Optional[PostCommitHookProperties]
|
properties for the post commit hook. If None, default values are used. |
None
|
commit_properties |
Optional[CommitProperties]
|
properties of the transaction commit. If None, default values are used. |
None
|
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
the metrics from optimize |
Example
Use a timedelta object to specify the seconds, minutes or hours of the interval.
from deltalake import DeltaTable, write_deltalake
from datetime import timedelta
import pyarrow as pa
write_deltalake("tmp", pa.table({"x": [1], "y": [4]}))
write_deltalake("tmp", pa.table({"x": [2], "y": [5]}), mode="append")
dt = DeltaTable("tmp")
time_delta = timedelta(minutes=10)
dt.optimize.z_order(["x"], min_commit_interval=time_delta)
{'numFilesAdded': 1, 'numFilesRemoved': 2, 'filesAdded': ..., 'filesRemoved': ..., 'partitionsOptimized': 0, 'numBatches': 1, 'totalConsideredFiles': 2, 'totalFilesSkipped': 0, 'preserveInsertionOrder': True}