TableOptimizer
deltalake.table.TableOptimizer
API for table optimization commands.
compact
compact(partition_filters: FilterConjunctionType | None = None, target_size: int | None = None, max_concurrent_tasks: int | None = None, max_spill_size: int | None = None, max_temp_directory_size: int | None = None, min_commit_interval: int | timedelta | None = None, writer_properties: WriterProperties | None = None, *args: Any, commit_properties: CommitProperties | None = None, post_commithook_properties: PostCommitHookProperties | None = None) -> dict[str, Any]
Compacts small files to reduce read overhead.
This operation is idempotent; if run twice on the same table (assuming it has not been updated) it will do nothing the second time.
Compaction keeps file order within each partition.
The target size is approximate.
If this operation runs with any operation other than append, it fails.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
partition_filters |
FilterConjunctionType | None
|
partition filters used to match files |
None
|
target_size |
int | None
|
approximate target file size in bytes.
If not provided, this uses |
None
|
max_concurrent_tasks |
int | None
|
the maximum number of concurrent tasks to use for file compaction. Defaults to number of CPUs. More concurrent tasks can make compaction faster, but will also use more memory. |
None
|
max_spill_size |
int | None
|
the maximum number of bytes allowed in memory before spilling to disk. If not specified, uses DataFusion's default. |
None
|
max_temp_directory_size |
int | None
|
the maximum disk space for temporary spill files. If not specified, uses DataFusion's default. |
None
|
min_commit_interval |
int | timedelta | None
|
minimum interval in seconds or as timedeltas before a new commit is created. Interval is useful for long running executions. Set to 0 or timedelta(0), if you want a commit per partition. |
None
|
writer_properties |
WriterProperties | None
|
Pass writer properties to the Rust parquet writer. |
None
|
commit_properties |
CommitProperties | None
|
properties of the transaction commit. If None, default values are used. |
None
|
post_commithook_properties |
PostCommitHookProperties | None
|
properties for the post commit hook. If None, default values are used. |
None
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
optimize metrics |
Example
Use a timedelta object to specify the seconds, minutes or hours of the interval.
from deltalake import DeltaTable, write_deltalake
from datetime import timedelta
import pyarrow as pa
write_deltalake("tmp", pa.table({"x": [1], "y": [4]}))
write_deltalake("tmp", pa.table({"x": [2], "y": [5]}), mode="append")
dt = DeltaTable("tmp")
time_delta = timedelta(minutes=10)
dt.optimize.compact(min_commit_interval=time_delta)
{'numFilesAdded': 1, 'numFilesRemoved': 2, 'filesAdded': ..., 'filesRemoved': ..., 'partitionsOptimized': 1, 'numBatches': 2, 'totalConsideredFiles': 2, 'totalFilesSkipped': 0, 'plannerStrategy': 'preserveLocality', 'preservedStableOrder': True, 'preserveInsertionOrder': True, 'maxBinSpanFiles': 2}
z_order
z_order(columns: Iterable[str], partition_filters: FilterConjunctionType | None = None, target_size: int | None = None, max_concurrent_tasks: int | None = None, max_spill_size: int | None = None, max_temp_directory_size: int | None = None, min_commit_interval: int | timedelta | None = None, writer_properties: WriterProperties | None = None, *args: Any, commit_properties: CommitProperties | None = None, post_commithook_properties: PostCommitHookProperties | None = None) -> dict[str, Any]
Reorders the data using a Z-order curve to improve data skipping.
This also performs compaction, so the same parameters as compact() apply. Z order rewrites file order.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
columns |
Iterable[str]
|
the columns to use for Z-ordering. There must be at least one column. |
required |
partition_filters |
FilterConjunctionType | None
|
partition filters used to match files |
None
|
target_size |
int | None
|
approximate target file size in bytes.
If not provided, this uses |
None
|
max_concurrent_tasks |
int | None
|
the maximum number of concurrent tasks to use for file compaction. Defaults to number of CPUs. More concurrent tasks can make compaction faster, but will also use more memory. |
None
|
max_spill_size |
int | None
|
the maximum number of bytes allowed in memory before spilling to disk. If not specified, uses DataFusion's default. |
None
|
max_temp_directory_size |
int | None
|
the maximum disk space for temporary spill files. If not specified, uses DataFusion's default. |
None
|
min_commit_interval |
int | timedelta | None
|
minimum interval in seconds or as timedeltas before a new commit is created. Interval is useful for long running executions. Set to 0 or timedelta(0), if you want a commit per partition. |
None
|
writer_properties |
WriterProperties | None
|
Pass writer properties to the Rust parquet writer. |
None
|
commit_properties |
CommitProperties | None
|
properties of the transaction commit. If None, default values are used. |
None
|
post_commithook_properties |
PostCommitHookProperties | None
|
properties for the post commit hook. If None, default values are used. |
None
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
optimize metrics |
Example
Use a timedelta object to specify the seconds, minutes or hours of the interval.
from deltalake import DeltaTable, write_deltalake
from datetime import timedelta
import pyarrow as pa
write_deltalake("tmp", pa.table({"x": [1], "y": [4]}))
write_deltalake("tmp", pa.table({"x": [2], "y": [5]}), mode="append")
dt = DeltaTable("tmp")
time_delta = timedelta(minutes=10)
dt.optimize.z_order(["x"], min_commit_interval=time_delta)
{'numFilesAdded': 1, 'numFilesRemoved': 2, 'filesAdded': ..., 'filesRemoved': ..., 'partitionsOptimized': 0, 'numBatches': 1, 'totalConsideredFiles': 2, 'totalFilesSkipped': 0, 'plannerStrategy': 'zOrder', 'preservedStableOrder': False, 'preserveInsertionOrder': False, 'maxBinSpanFiles': 2}