Skip to content

TableOptimizer

deltalake.table.TableOptimizer

TableOptimizer(table: DeltaTable)

API for table optimization commands.

compact

compact(partition_filters: FilterConjunctionType | None = None, target_size: int | None = None, max_concurrent_tasks: int | None = None, max_spill_size: int | None = None, max_temp_directory_size: int | None = None, min_commit_interval: int | timedelta | None = None, writer_properties: WriterProperties | None = None, *args: Any, commit_properties: CommitProperties | None = None, post_commithook_properties: PostCommitHookProperties | None = None) -> dict[str, Any]

Compacts small files to reduce read overhead.

This operation is idempotent; if run twice on the same table (assuming it has not been updated) it will do nothing the second time.

Compaction keeps file order within each partition.

The target size is approximate.

If this operation runs with any operation other than append, it fails.

Parameters:

Name Type Description Default
partition_filters FilterConjunctionType | None

partition filters used to match files

None
target_size int | None

approximate target file size in bytes. If not provided, this uses delta.targetFileSize or 100MB.

None
max_concurrent_tasks int | None

the maximum number of concurrent tasks to use for file compaction. Defaults to number of CPUs. More concurrent tasks can make compaction faster, but will also use more memory.

None
max_spill_size int | None

the maximum number of bytes allowed in memory before spilling to disk. If not specified, uses DataFusion's default.

None
max_temp_directory_size int | None

the maximum disk space for temporary spill files. If not specified, uses DataFusion's default.

None
min_commit_interval int | timedelta | None

minimum interval in seconds or as timedeltas before a new commit is created. Interval is useful for long running executions. Set to 0 or timedelta(0), if you want a commit per partition.

None
writer_properties WriterProperties | None

Pass writer properties to the Rust parquet writer.

None
commit_properties CommitProperties | None

properties of the transaction commit. If None, default values are used.

None
post_commithook_properties PostCommitHookProperties | None

properties for the post commit hook. If None, default values are used.

None

Returns:

Type Description
dict[str, Any]

optimize metrics

Example

Use a timedelta object to specify the seconds, minutes or hours of the interval.

from deltalake import DeltaTable, write_deltalake
from datetime import timedelta
import pyarrow as pa

write_deltalake("tmp", pa.table({"x": [1], "y": [4]}))
write_deltalake("tmp", pa.table({"x": [2], "y": [5]}), mode="append")

dt = DeltaTable("tmp")
time_delta = timedelta(minutes=10)
dt.optimize.compact(min_commit_interval=time_delta)
{'numFilesAdded': 1, 'numFilesRemoved': 2, 'filesAdded': ..., 'filesRemoved': ..., 'partitionsOptimized': 1, 'numBatches': 2, 'totalConsideredFiles': 2, 'totalFilesSkipped': 0, 'plannerStrategy': 'preserveLocality', 'preservedStableOrder': True, 'preserveInsertionOrder': True, 'maxBinSpanFiles': 2}

z_order

z_order(columns: Iterable[str], partition_filters: FilterConjunctionType | None = None, target_size: int | None = None, max_concurrent_tasks: int | None = None, max_spill_size: int | None = None, max_temp_directory_size: int | None = None, min_commit_interval: int | timedelta | None = None, writer_properties: WriterProperties | None = None, *args: Any, commit_properties: CommitProperties | None = None, post_commithook_properties: PostCommitHookProperties | None = None) -> dict[str, Any]

Reorders the data using a Z-order curve to improve data skipping.

This also performs compaction, so the same parameters as compact() apply. Z order rewrites file order.

Parameters:

Name Type Description Default
columns Iterable[str]

the columns to use for Z-ordering. There must be at least one column.

required
partition_filters FilterConjunctionType | None

partition filters used to match files

None
target_size int | None

approximate target file size in bytes. If not provided, this uses delta.targetFileSize or 100MB.

None
max_concurrent_tasks int | None

the maximum number of concurrent tasks to use for file compaction. Defaults to number of CPUs. More concurrent tasks can make compaction faster, but will also use more memory.

None
max_spill_size int | None

the maximum number of bytes allowed in memory before spilling to disk. If not specified, uses DataFusion's default.

None
max_temp_directory_size int | None

the maximum disk space for temporary spill files. If not specified, uses DataFusion's default.

None
min_commit_interval int | timedelta | None

minimum interval in seconds or as timedeltas before a new commit is created. Interval is useful for long running executions. Set to 0 or timedelta(0), if you want a commit per partition.

None
writer_properties WriterProperties | None

Pass writer properties to the Rust parquet writer.

None
commit_properties CommitProperties | None

properties of the transaction commit. If None, default values are used.

None
post_commithook_properties PostCommitHookProperties | None

properties for the post commit hook. If None, default values are used.

None

Returns:

Type Description
dict[str, Any]

optimize metrics

Example

Use a timedelta object to specify the seconds, minutes or hours of the interval.

from deltalake import DeltaTable, write_deltalake
from datetime import timedelta
import pyarrow as pa

write_deltalake("tmp", pa.table({"x": [1], "y": [4]}))
write_deltalake("tmp", pa.table({"x": [2], "y": [5]}), mode="append")

dt = DeltaTable("tmp")
time_delta = timedelta(minutes=10)
dt.optimize.z_order(["x"], min_commit_interval=time_delta)
{'numFilesAdded': 1, 'numFilesRemoved': 2, 'filesAdded': ..., 'filesRemoved': ..., 'partitionsOptimized': 0, 'numBatches': 1, 'totalConsideredFiles': 2, 'totalFilesSkipped': 0, 'plannerStrategy': 'zOrder', 'preservedStableOrder': False, 'preserveInsertionOrder': False, 'maxBinSpanFiles': 2}