Skip to content

TableOptimizer

deltalake.table.TableOptimizer

TableOptimizer(table: DeltaTable)

API for various table optimization commands.

compact

compact(partition_filters: Optional[FilterType] = None, target_size: Optional[int] = None, max_concurrent_tasks: Optional[int] = None, min_commit_interval: Optional[Union[int, timedelta]] = None, writer_properties: Optional[WriterProperties] = None, custom_metadata: Optional[Dict[str, str]] = None) -> Dict[str, Any]

Compacts small files to reduce the total number of files in the table.

This operation is idempotent; if run twice on the same table (assuming it has not been updated) it will do nothing the second time.

If this operation happens concurrently with any operations other than append, it will fail.

Parameters:

Name Type Description Default
partition_filters Optional[FilterType]

the partition filters that will be used for getting the matched files

None
target_size Optional[int]

desired file size after bin-packing files, in bytes. If not provided, will attempt to read the table configuration value delta.targetFileSize. If that value isn't set, will use default value of 256MB.

None
max_concurrent_tasks Optional[int]

the maximum number of concurrent tasks to use for file compaction. Defaults to number of CPUs. More concurrent tasks can make compaction faster, but will also use more memory.

None
min_commit_interval Optional[Union[int, timedelta]]

minimum interval in seconds or as timedeltas before a new commit is created. Interval is useful for long running executions. Set to 0 or timedelta(0), if you want a commit per partition.

None
writer_properties Optional[WriterProperties]

Pass writer properties to the Rust parquet writer.

None
custom_metadata Optional[Dict[str, str]]

custom metadata that will be added to the transaction commit.

None

Returns:

Type Description
Dict[str, Any]

the metrics from optimize

Example

Use a timedelta object to specify the seconds, minutes or hours of the interval.

from deltalake import DeltaTable, write_deltalake
from datetime import timedelta
import pyarrow as pa

write_deltalake("tmp", pa.table({"x": [1], "y": [4]}))
write_deltalake("tmp", pa.table({"x": [2], "y": [5]}), mode="append")

dt = DeltaTable("tmp")
time_delta = timedelta(minutes=10)
dt.optimize.compact(min_commit_interval=time_delta)
{'numFilesAdded': 1, 'numFilesRemoved': 2, 'filesAdded': ..., 'filesRemoved': ..., 'partitionsOptimized': 1, 'numBatches': 2, 'totalConsideredFiles': 2, 'totalFilesSkipped': 0, 'preserveInsertionOrder': True}

z_order

z_order(columns: Iterable[str], partition_filters: Optional[FilterType] = None, target_size: Optional[int] = None, max_concurrent_tasks: Optional[int] = None, max_spill_size: int = 20 * 1024 * 1024 * 1024, min_commit_interval: Optional[Union[int, timedelta]] = None, writer_properties: Optional[WriterProperties] = None, custom_metadata: Optional[Dict[str, str]] = None) -> Dict[str, Any]

Reorders the data using a Z-order curve to improve data skipping.

This also performs compaction, so the same parameters as compact() apply.

Parameters:

Name Type Description Default
columns Iterable[str]

the columns to use for Z-ordering. There must be at least one column.

required
partition_filters Optional[FilterType]

the partition filters that will be used for getting the matched files

None
target_size Optional[int]

desired file size after bin-packing files, in bytes. If not provided, will attempt to read the table configuration value delta.targetFileSize. If that value isn't set, will use default value of 256MB.

None
max_concurrent_tasks Optional[int]

the maximum number of concurrent tasks to use for file compaction. Defaults to number of CPUs. More concurrent tasks can make compaction faster, but will also use more memory.

None
max_spill_size int

the maximum number of bytes to spill to disk. Defaults to 20GB.

20 * 1024 * 1024 * 1024
min_commit_interval Optional[Union[int, timedelta]]

minimum interval in seconds or as timedeltas before a new commit is created. Interval is useful for long running executions. Set to 0 or timedelta(0), if you want a commit per partition.

None
writer_properties Optional[WriterProperties]

Pass writer properties to the Rust parquet writer.

None
custom_metadata Optional[Dict[str, str]]

custom metadata that will be added to the transaction commit.

None

Returns:

Type Description
Dict[str, Any]

the metrics from optimize

Example

Use a timedelta object to specify the seconds, minutes or hours of the interval.

from deltalake import DeltaTable, write_deltalake
from datetime import timedelta
import pyarrow as pa

write_deltalake("tmp", pa.table({"x": [1], "y": [4]}))
write_deltalake("tmp", pa.table({"x": [2], "y": [5]}), mode="append")

dt = DeltaTable("tmp")
time_delta = timedelta(minutes=10)
dt.optimize.z_order(["x"], min_commit_interval=time_delta)
{'numFilesAdded': 1, 'numFilesRemoved': 2, 'filesAdded': ..., 'filesRemoved': ..., 'partitionsOptimized': 0, 'numBatches': 1, 'totalConsideredFiles': 2, 'totalFilesSkipped': 0, 'preserveInsertionOrder': True}