TableOptimizer

deltalake.table.TableOptimizer

TableOptimizer(table: DeltaTable)

API for various table optimization commands.

compact

compact(partition_filters: Optional[FilterType] = None, target_size: Optional[int] = None, max_concurrent_tasks: Optional[int] = None, min_commit_interval: Optional[Union[int, timedelta]] = None, writer_properties: Optional[WriterProperties] = None, custom_metadata: Optional[Dict[str, str]] = None) -> Dict[str, Any]

Compacts small files to reduce the total number of files in the table.

This operation is idempotent; if run twice on the same table (assuming it has not been updated) it will do nothing the second time.

If this operation happens concurrently with any operations other than append, it will fail.

Parameters:

Name	Type	Description	Default
`partition_filters`	`Optional[FilterType]`	the partition filters that will be used for getting the matched files	`None`
`target_size`	`Optional[int]`	desired file size after bin-packing files, in bytes. If not provided, will attempt to read the table configuration value `delta.targetFileSize`. If that value isn't set, will use default value of 256MB.	`None`
`max_concurrent_tasks`	`Optional[int]`	the maximum number of concurrent tasks to use for file compaction. Defaults to number of CPUs. More concurrent tasks can make compaction faster, but will also use more memory.	`None`
`min_commit_interval`	`Optional[Union[int, timedelta]]`	minimum interval in seconds or as timedeltas before a new commit is created. Interval is useful for long running executions. Set to 0 or timedelta(0), if you want a commit per partition.	`None`
`writer_properties`	`Optional[WriterProperties]`	Pass writer properties to the Rust parquet writer.	`None`
`custom_metadata`	`Optional[Dict[str, str]]`	custom metadata that will be added to the transaction commit.	`None`

Returns:

Type	Description
`Dict[str, Any]`	the metrics from optimize

Example

Use a timedelta object to specify the seconds, minutes or hours of the interval.

from deltalake import DeltaTable, write_deltalake
from datetime import timedelta
import pyarrow as pa

write_deltalake("tmp", pa.table({"x": [1], "y": [4]}))
write_deltalake("tmp", pa.table({"x": [2], "y": [5]}), mode="append")

dt = DeltaTable("tmp")
time_delta = timedelta(minutes=10)
dt.optimize.compact(min_commit_interval=time_delta)
{'numFilesAdded': 1, 'numFilesRemoved': 2, 'filesAdded': ..., 'filesRemoved': ..., 'partitionsOptimized': 1, 'numBatches': 2, 'totalConsideredFiles': 2, 'totalFilesSkipped': 0, 'preserveInsertionOrder': True}

z_order

z_order(columns: Iterable[str], partition_filters: Optional[FilterType] = None, target_size: Optional[int] = None, max_concurrent_tasks: Optional[int] = None, max_spill_size: int = 20 * 1024 * 1024 * 1024, min_commit_interval: Optional[Union[int, timedelta]] = None, writer_properties: Optional[WriterProperties] = None, custom_metadata: Optional[Dict[str, str]] = None) -> Dict[str, Any]

Reorders the data using a Z-order curve to improve data skipping.

This also performs compaction, so the same parameters as compact() apply.

Parameters:

Name	Type	Description	Default
`columns`	`Iterable[str]`	the columns to use for Z-ordering. There must be at least one column.	required
`partition_filters`	`Optional[FilterType]`	the partition filters that will be used for getting the matched files	`None`
`target_size`	`Optional[int]`	desired file size after bin-packing files, in bytes. If not provided, will attempt to read the table configuration value `delta.targetFileSize`. If that value isn't set, will use default value of 256MB.	`None`
`max_concurrent_tasks`	`Optional[int]`	the maximum number of concurrent tasks to use for file compaction. Defaults to number of CPUs. More concurrent tasks can make compaction faster, but will also use more memory.	`None`
`max_spill_size`	`int`	the maximum number of bytes to spill to disk. Defaults to 20GB.	`20 * 1024 * 1024 * 1024`
`min_commit_interval`	`Optional[Union[int, timedelta]]`	minimum interval in seconds or as timedeltas before a new commit is created. Interval is useful for long running executions. Set to 0 or timedelta(0), if you want a commit per partition.	`None`
`writer_properties`	`Optional[WriterProperties]`	Pass writer properties to the Rust parquet writer.	`None`
`custom_metadata`	`Optional[Dict[str, str]]`	custom metadata that will be added to the transaction commit.	`None`

Returns:

Type	Description
`Dict[str, Any]`	the metrics from optimize

Example

Use a timedelta object to specify the seconds, minutes or hours of the interval.

from deltalake import DeltaTable, write_deltalake
from datetime import timedelta
import pyarrow as pa

write_deltalake("tmp", pa.table({"x": [1], "y": [4]}))
write_deltalake("tmp", pa.table({"x": [2], "y": [5]}), mode="append")

dt = DeltaTable("tmp")
time_delta = timedelta(minutes=10)
dt.optimize.z_order(["x"], min_commit_interval=time_delta)
{'numFilesAdded': 1, 'numFilesRemoved': 2, 'filesAdded': ..., 'filesRemoved': ..., 'partitionsOptimized': 0, 'numBatches': 1, 'totalConsideredFiles': 2, 'totalFilesSkipped': 0, 'preserveInsertionOrder': True}