API Reference

DeltaTable

class deltalake.table.DeltaTable(table_uri, version=None, storage_options=None, without_files=False, log_buffer_size=None)

Create a DeltaTable instance.

Parameters

table_uri (Union[str, pathlib.Path]) –
version (Optional[int]) –
storage_options (Optional[Dict[str, str]]) –
without_files (bool) –
log_buffer_size (Optional[int]) –

delete(predicate=None)

Delete records from a Delta Table that statisfy a predicate.

When a predicate is not provided then all records are deleted from the Delta Table. Otherwise a scan of the Delta table is performed to mark any files that contain records that satisfy the predicate. Once files are determined they are rewritten without the records.

Parameters: predicate (Optional[str]) – a SQL where clause. If not passed, will delete all rows.
Returns: the metrics from delete.
Return type: Dict[str, Any]

file_uris(partition_filters=None)

Get the list of files as absolute URIs, including the scheme (e.g. “s3://”).

Local files will be just plain absolute paths, without a scheme. (That is, no ‘file://’ prefix.)

Use the partition_filters parameter to retrieve a subset of files that match the given filters.

Parameters: partition_filters (Optional[List[Tuple[str, str, Any]]]) – the partition filters that will be used for getting the matched files
Returns: list of the .parquet files with an absolute URI referenced for the current version of the DeltaTable
Return type: List[str]

Predicates are expressed in disjunctive normal form (DNF), like [(“x”, “=”, “a”), …]. DNF allows arbitrary boolean logical combinations of single partition predicates. The innermost tuples each describe a single partition predicate. The list of inner predicates is interpreted as a conjunction (AND), forming a more selective and multiple partition predicates. Each tuple has format: (key, op, value) and compares the key with the value. The supported op are: =, !=, in, and not in. If the op is in or not in, the value must be a collection such as a list, a set or a tuple. The supported type for value is str. Use empty string ‘’ for Null partition value.

Examples: (“x”, “=”, “a”) (“x”, “!=”, “a”) (“y”, “in”, [“a”, “b”, “c”]) (“z”, “not in”, [“a”,”b”])

files(partition_filters=None)

Get the .parquet files of the DeltaTable.

The paths are as they are saved in the delta log, which may either be relative to the table root or absolute URIs.

Parameters: partition_filters (Optional[List[Tuple[str, str, Any]]]) – the partition filters that will be used for getting the matched files
Returns: list of the .parquet files referenced for the current version of the DeltaTable
Return type: List[str]

Predicates are expressed in disjunctive normal form (DNF), like [(“x”, “=”, “a”), …]. DNF allows arbitrary boolean logical combinations of single partition predicates. The innermost tuples each describe a single partition predicate. The list of inner predicates is interpreted as a conjunction (AND), forming a more selective and multiple partition predicates. Each tuple has format: (key, op, value) and compares the key with the value. The supported op are: =, !=, in, and not in. If the op is in or not in, the value must be a collection such as a list, a set or a tuple. The supported type for value is str. Use empty string ‘’ for Null partition value.

Examples: (“x”, “=”, “a”) (“x”, “!=”, “a”) (“y”, “in”, [“a”, “b”, “c”]) (“z”, “not in”, [“a”,”b”])

classmethod from_data_catalog(data_catalog, database_name, table_name, data_catalog_id=None, version=None, log_buffer_size=None)

Create the Delta Table from a Data Catalog.

Parameters

data_catalog (deltalake.data_catalog.DataCatalog) – the Catalog to use for getting the storage location of the Delta Table
database_name (str) – the database name inside the Data Catalog
table_name (str) – the table name inside the Data Catalog
data_catalog_id (Optional[str]) – the identifier of the Data Catalog
version (Optional[int]) – version of the DeltaTable
log_buffer_size (Optional[int]) – Number of files to buffer when reading the commit log. A positive integer. Setting a value greater than 1 results in concurrent calls to the storage api. This can decrease latency if there are many files in the log since the last checkpoint, but will also increase memory usage. Possible rate limits of the storage backend should also be considered for optimal performance. Defaults to 4 * number of cpus.

Return type

deltalake.table.DeltaTable

get_add_actions(flatten=False)

Return a dataframe with all current add actions.

Add actions represent the files that currently make up the table. This data is a low-level representation parsed from the transaction log.

Parameters: flatten (bool) – whether to flatten the schema. Partition values columns are given the prefix partition., statistics (null_count, min, and max) are given the prefix null_count., min., and max., and tags the prefix tags.. Nested field names are concatenated with ..
Returns: a PyArrow RecordBatch containing the add action data.
Return type: pyarrow.lib.RecordBatch

Examples:

>>> from deltalake import DeltaTable, write_deltalake
>>> import pyarrow as pa
>>> data = pa.table({"x": [1, 2, 3], "y": [4, 5, 6]})
>>> write_deltalake("tmp", data, partition_by=["x"])
>>> dt = DeltaTable("tmp")
>>> dt.get_add_actions().to_pandas()
                                                        path  size_bytes       modification_time  data_change partition_values  num_records null_count       min       max
0  x=2/0-91820cbf-f698-45fb-886d-5d5f5669530b-0.p...         565 1970-01-20 08:40:08.071         True         {'x': 2}            1   {'y': 0}  {'y': 5}  {'y': 5}
1  x=3/0-91820cbf-f698-45fb-886d-5d5f5669530b-0.p...         565 1970-01-20 08:40:08.071         True         {'x': 3}            1   {'y': 0}  {'y': 6}  {'y': 6}
2  x=1/0-91820cbf-f698-45fb-886d-5d5f5669530b-0.p...         565 1970-01-20 08:40:08.071         True         {'x': 1}            1   {'y': 0}  {'y': 4}  {'y': 4}
>>> dt.get_add_actions(flatten=True).to_pandas()
                                                        path  size_bytes       modification_time  data_change  partition.x  num_records  null_count.y  min.y  max.y
0  x=2/0-91820cbf-f698-45fb-886d-5d5f5669530b-0.p...         565 1970-01-20 08:40:08.071         True            2            1             0      5      5
1  x=3/0-91820cbf-f698-45fb-886d-5d5f5669530b-0.p...         565 1970-01-20 08:40:08.071         True            3            1             0      6      6
2  x=1/0-91820cbf-f698-45fb-886d-5d5f5669530b-0.p...         565 1970-01-20 08:40:08.071         True            1            1             0      4      4

history(limit=None)

Run the history command on the DeltaTable. The operations are returned in reverse chronological order.

Parameters: limit (Optional[int]) – the commit info limit to return
Returns: list of the commit infos registered in the transaction log
Return type: List[Dict[str, Any]]

load_version(version)

Load a DeltaTable with a specified version.

Parameters: version (int) – the identifier of the version of the DeltaTable to load
Return type: None

load_with_datetime(datetime_string)

Time travel Delta table to the latest version that’s created at or before provided datetime_string argument. The datetime_string argument should be an RFC 3339 and ISO 8601 date and time string.

Examples: 2018-01-26T18:30:09Z 2018-12-19T16:39:57-08:00 2018-01-26T18:30:09.453+00:00

Parameters: datetime_string (str) – the identifier of the datetime point of the DeltaTable to load
Return type: None

merge(source, predicate, source_alias=None, target_alias=None, error_on_type_mismatch=True)

Pass the source data which you want to merge on the target delta table, providing a predicate in SQL query like format. You can also specify on what to do when the underlying data types do not match the underlying table.

Args:: source (pyarrow.Table | pyarrow.RecordBatch | pyarrow.RecordBatchReader ): source data predicate (str): SQL like predicate on how to merge source_alias (str): Alias for the source table target_alias (str): Alias for the target table error_on_type_mismatch (bool): specify if merge will return error if data types are mismatching :default = True
Returns:: TableMerger: TableMerger Object

Parameters

source (Union[pyarrow.lib.Table, pyarrow.lib.RecordBatch, pyarrow.lib.RecordBatchReader]) –
predicate (str) –
source_alias (Optional[str]) –
target_alias (Optional[str]) –
error_on_type_mismatch (bool) –

Return type

deltalake.table.TableMerger

metadata()

Get the current metadata of the DeltaTable.

Returns: the current Metadata registered in the transaction log
Return type: deltalake.table.Metadata

protocol()

Get the reader and writer protocol versions of the DeltaTable.

Returns: the current ProtocolVersions registered in the transaction log
Return type: deltalake.table.ProtocolVersions

restore(target, *, ignore_missing_files=False, protocol_downgrade_allowed=False)

Run the Restore command on the Delta Table: restore table to a given version or datetime.

Parameters

target (Union[int, datetime.datetime, str]) – the expected version will restore, which represented by int, date str or datetime.
ignore_missing_files (bool) – whether the operation carry on when some data files missing.
protocol_downgrade_allowed (bool) – whether the operation when protocol version upgraded.

Returns

the metrics from restore.

Return type

Dict[str, Any]

schema()

Get the current schema of the DeltaTable.

Returns: the current Schema registered in the transaction log
Return type: deltalake._internal.Schema

to_pandas(partitions=None, columns=None, filesystem=None, filters=None)

Build a pandas dataframe using data from the DeltaTable.

Parameters

partitions (Optional[List[Tuple[str, str, Any]]]) – A list of partition filters, see help(DeltaTable.files_by_partitions) for filter syntax
columns (Optional[List[str]]) – The columns to project. This can be a list of column names to include (order and duplicates will be preserved)
filesystem (Optional[Union[str, pyarrow._fs.FileSystem]]) – A concrete implementation of the Pyarrow FileSystem or a fsspec-compatible interface. If None, the first file path will be used to determine the right FileSystem
filters (Optional[Union[List[Tuple[str, str, Any]], List[List[Tuple[str, str, Any]]]]]) – A disjunctive normal form (DNF) predicate for filtering rows. If you pass a filter you do not need to pass partitions

Return type

pandas.DataFrame

to_pyarrow_dataset(partitions=None, filesystem=None, parquet_read_options=None)

Build a PyArrow Dataset using data from the DeltaTable.

Parameters

partitions (Optional[List[Tuple[str, str, Any]]]) – A list of partition filters, see help(DeltaTable.files_by_partitions) for filter syntax
filesystem (Optional[Union[str, pyarrow._fs.FileSystem]]) – A concrete implementation of the Pyarrow FileSystem or a fsspec-compatible interface. If None, the first file path will be used to determine the right FileSystem
parquet_read_options (Optional[pyarrow._dataset_parquet.ParquetReadOptions]) – Optional read options for Parquet. Use this to handle INT96 to timestamp conversion for edge cases like 0001-01-01 or 9999-12-31 More info: https://arrow.apache.org/docs/python/generated/pyarrow.dataset.ParquetReadOptions.html

Returns

the PyArrow dataset in PyArrow

Return type

pyarrow._dataset.Dataset

to_pyarrow_table(partitions=None, columns=None, filesystem=None, filters=None)

Build a PyArrow Table using data from the DeltaTable.

Parameters

partitions (Optional[List[Tuple[str, str, Any]]]) – A list of partition filters, see help(DeltaTable.files_by_partitions) for filter syntax
columns (Optional[List[str]]) – The columns to project. This can be a list of column names to include (order and duplicates will be preserved)
filesystem (Optional[Union[str, pyarrow._fs.FileSystem]]) – A concrete implementation of the Pyarrow FileSystem or a fsspec-compatible interface. If None, the first file path will be used to determine the right FileSystem
filters (Optional[Union[List[Tuple[str, str, Any]], List[List[Tuple[str, str, Any]]]]]) – A disjunctive normal form (DNF) predicate for filtering rows. If you pass a filter you do not need to pass partitions

Return type

pyarrow.lib.Table

update(updates, predicate=None, writer_properties=None, error_on_type_mismatch=True)

UPDATE records in the Delta Table that matches an optional predicate.

Parameters

updates (Dict[str, str]) – a mapping of column name to update SQL expression.
predicate (Optional[str]) – a logical expression, defaults to None
writer_properties (Optional[Dict[str, int]]) –
error_on_type_mismatch (bool) –

Writer_properties

Pass writer properties to the Rust parquet writer, see options https://arrow.apache.org/rust/parquet/file/properties/struct.WriterProperties.html, only the fields: data_page_size_limit, dictionary_page_size_limit, data_page_row_count_limit, write_batch_size, max_row_group_size are supported.

Error_on_type_mismatch

specify if merge will return error if data types are mismatching :default = True

Returns

the metrics from delete

Return type

Dict[str, Any]

Examples:

Update some row values with SQL predicate. This is equivalent to UPDATE table SET deleted = true WHERE id = '5'

>>> from deltalake import DeltaTable
>>> dt = DeltaTable("tmp")
>>> dt.update(predicate="id = '5'",
...           updates = {
...             "deleted": True,
...             }
...         )

Update all row values. This is equivalent to UPDATE table SET id = concat(id, '_old'). >>> from deltalake import DeltaTable >>> dt = DeltaTable(“tmp”) >>> dt.update(updates = { … “deleted”: True, … “id”: “concat(id, ‘_old’)” … } … )

update_incremental()

Updates the DeltaTable to the latest version by incrementally applying newer versions.

Return type: None

vacuum(retention_hours=None, dry_run=True, enforce_retention_duration=True)

Run the Vacuum command on the Delta Table: list and delete files no longer referenced by the Delta table and are older than the retention threshold.

Parameters

retention_hours (Optional[int]) – the retention threshold in hours, if none then the value from configuration.deletedFileRetentionDuration is used or default of 1 week otherwise.
dry_run (bool) – when activated, list only the files, delete otherwise
enforce_retention_duration (bool) – when disabled, accepts retention hours smaller than the value from configuration.deletedFileRetentionDuration.

Returns

the list of files no longer referenced by the Delta Table and are older than the retention threshold.

Return type

List[str]

version()

Get the version of the DeltaTable.

Returns: The current version of the DeltaTable
Return type: int

class deltalake.table.Metadata(table)

Create a Metadata instance.

Parameters: table (deltalake._internal.RawDeltaTable) –

property configuration: Dict[str, str]: Return the DeltaTable properties.

property created_time: int: Return The time when this metadata action is created, in milliseconds since the Unix epoch of the DeltaTable.

property description: str: Return the user-provided description of the DeltaTable.

property id: int: Return the unique identifier of the DeltaTable.

property name: str: Return the user-provided identifier of the DeltaTable.

property partition_columns: List[str]: Return an array containing the names of the partitioned columns of the DeltaTable.

class deltalake.table.ProtocolVersions(min_reader_version, min_writer_version)

Parameters

min_reader_version (int) –
min_writer_version (int) –

min_reader_version: int: Alias for field number 0

min_writer_version: int: Alias for field number 1

class deltalake.table.TableMerger(table, source, predicate, source_alias=None, target_alias=None, safe_cast=True)

API for various table MERGE commands.

Parameters

table (deltalake.table.DeltaTable) –
source (pyarrow.lib.RecordBatchReader) –
predicate (str) –
source_alias (Optional[str]) –
target_alias (Optional[str]) –
safe_cast (bool) –

execute()

Executes MERGE with the previously provided settings in Rust with Apache Datafusion query engine.

Returns:: Dict[str, any]: metrics

Return type: Dict[str, Any]

when_matched_delete(predicate=None)

Delete a matched row from the table only if the given predicate (if specified) is true for the matched row. If not specified it deletes all matches.

Args:: predicate (str | None, optional): SQL like predicate on when to delete. Defaults to None.
Returns:: TableMerger: TableMerger Object

Examples:

Delete on a predicate

>>> from deltalake import DeltaTable
>>> import pyarrow as pa
>>> data = pa.table({"x": [1, 2, 3], "y": [4, 5, 6]})
>>> dt = DeltaTable("tmp")
>>> dt.merge(source=data, predicate='target.x = source.x', source_alias='source', target_alias='target')         ...     .when_matched_delete(predicate = "source.deleted = true")
...     .execute()

Delete all records that were matched

>>> from deltalake import DeltaTable
>>> import pyarrow as pa
>>> data = pa.table({"x": [1, 2, 3], "y": [4, 5, 6]})
>>> dt = DeltaTable("tmp")
>>> dt.merge(source=data, predicate='target.x = source.x', source_alias='source', target_alias='target')          ...     .when_matched_delete()
...     .execute()

Parameters: predicate (Optional[str]) –
Return type: deltalake.table.TableMerger

when_matched_update(updates, predicate=None)

Update a matched table row based on the rules defined by updates. If a predicate is specified, then it must evaluate to true for the row to be updated.

Args:: updates (dict): a mapping of column name to update SQL expression. predicate (str | None, optional): SQL like predicate on when to update. Defaults to None.
Returns:: TableMerger: TableMerger Object

Examples:

>>> from deltalake import DeltaTable
>>> import pyarrow as pa
>>> data = pa.table({"x": [1, 2, 3], "y": [4, 5, 6]})
>>> dt = DeltaTable("tmp")
>>> dt.merge(source=data, predicate='target.x = source.x', source_alias='source', target_alias='target')         ...     .when_matched_update(
...         updates = {
...             "x": "source.x",
...             "y": "source.y"
...             }
...         ).execute()

Parameters

updates (Dict[str, str]) –
predicate (Optional[str]) –

Return type

deltalake.table.TableMerger

when_matched_update_all(predicate=None)

Updating all source fields to target fields, source and target are required to have the same field names. If a predicate is specified, then it must evaluate to true for the row to be updated.

Args:: predicate (str | None, optional): SQL like predicate on when to update all columns. Defaults to None.
Returns:: TableMerger: TableMerger Object

Examples:

>>> from deltalake import DeltaTable
>>> import pyarrow as pa
>>> data = pa.table({"x": [1, 2, 3], "y": [4, 5, 6]})
>>> dt = DeltaTable("tmp")
>>> dt.merge(source=data, predicate='target.x = source.x', source_alias='source', target_alias='target')          ...     .when_matched_update_all().execute()

Parameters: predicate (Optional[str]) –
Return type: deltalake.table.TableMerger

when_not_matched_by_source_delete(predicate=None)

Delete a target row that has no matches in the source from the table only if the given predicate (if specified) is true for the target row.

Args:: updates (dict): a mapping of column name to update SQL expression. predicate (str | None, optional): SQL like predicate on when to delete when not matched by source. Defaults to None.
Returns:: TableMerger: TableMerger Object

Parameters: predicate (Optional[str]) –
Return type: deltalake.table.TableMerger

when_not_matched_by_source_update(updates, predicate=None)

Update a target row that has no matches in the source based on the rules defined by updates. If a predicate is specified, then it must evaluate to true for the row to be updated.

Args:: updates (dict): a mapping of column name to update SQL expression. predicate (str | None, optional): SQL like predicate on when to update. Defaults to None.
Returns:: TableMerger: TableMerger Object

>>> from deltalake import DeltaTable
>>> import pyarrow as pa
>>> data = pa.table({"x": [1, 2, 3], "y": [4, 5, 6]})
>>> dt = DeltaTable("tmp")
>>> dt.merge(source=data, predicate='target.x = source.x', source_alias='source', target_alias='target')         ...     .when_not_matched_by_source_update(
...         predicate = "y > 3"
...         updates = {
...             "y": "0",
...             }
...         ).execute()    

Parameters

updates (Dict[str, str]) –
predicate (Optional[str]) –

Return type

deltalake.table.TableMerger

when_not_matched_insert(updates, predicate=None)

Insert a new row to the target table based on the rules defined by updates. If a predicate is specified, then it must evaluate to true for the new row to be inserted.

Args:: updates (dict): a mapping of column name to insert SQL expression. predicate (str | None, optional): SQL like predicate on when to insert. Defaults to None.
Returns:: TableMerger: TableMerger Object

Examples:

>>> from deltalake import DeltaTable
>>> import pyarrow as pa
>>> data = pa.table({"x": [1, 2, 3], "y": [4, 5, 6]})
>>> dt = DeltaTable("tmp")
>>> dt.merge(source=data, predicate='target.x = source.x', source_alias='source', target_alias='target')          ...     .when_not_matched_insert(
...         updates = {
...             "x": "source.x",
...             "y": "source.y"
...             }
...         ).execute()

Parameters

updates (Dict[str, str]) –
predicate (Optional[str]) –

Return type

deltalake.table.TableMerger

when_not_matched_insert_all(predicate=None)

Insert a new row to the target table, updating all source fields to target fields. Source and target are required to have the same field names. If a predicate is specified, then it must evaluate to true for the new row to be inserted.

Args:: predicate (str | None, optional): SQL like predicate on when to insert. Defaults to None.
Returns:: TableMerger: TableMerger Object

Examples:

>>> from deltalake import DeltaTable
>>> import pyarrow as pa
>>> data = pa.table({"x": [1, 2, 3], "y": [4, 5, 6]})
>>> dt = DeltaTable("tmp")
>>> dt.merge(source=data, predicate='target.x = source.x', source_alias='source', target_alias='target')          ...     .when_not_matched_insert_all().execute()

Parameters: predicate (Optional[str]) –
Return type: deltalake.table.TableMerger

with_writer_properties(data_page_size_limit=None, dictionary_page_size_limit=None, data_page_row_count_limit=None, write_batch_size=None, max_row_group_size=None)

Pass writer properties to the Rust parquet writer, see options https://arrow.apache.org/rust/parquet/file/properties/struct.WriterProperties.html:

Args:: data_page_size_limit (int|None, optional): Limit DataPage size to this in bytes. Defaults to None. dictionary_page_size_limit (int|None, optional): Limit the size of each DataPage to store dicts to this amount in bytes. Defaults to None. data_page_row_count_limit (int|None, optional): Limit the number of rows in each DataPage. Defaults to None. write_batch_size (int|None, optional): Splits internally to smaller batch size. Defaults to None. max_row_group_size (int|None, optional): Max number of rows in row group. Defaults to None.
Returns:: TableMerger: TableMerger Object

Parameters

data_page_size_limit (Optional[int]) –
dictionary_page_size_limit (Optional[int]) –
data_page_row_count_limit (Optional[int]) –
write_batch_size (Optional[int]) –
max_row_group_size (Optional[int]) –

Return type

deltalake.table.TableMerger

class deltalake.table.TableOptimizer(table)

API for various table optimization commands.

Parameters: table (deltalake.table.DeltaTable) –

compact(partition_filters=None, target_size=None, max_concurrent_tasks=None, min_commit_interval=None)

Compacts small files to reduce the total number of files in the table.

This operation is idempotent; if run twice on the same table (assuming it has not been updated) it will do nothing the second time.

If this operation happens concurrently with any operations other than append, it will fail.

Parameters

partition_filters (Optional[Union[List[Tuple[str, str, Any]], List[List[Tuple[str, str, Any]]]]]) – the partition filters that will be used for getting the matched files
target_size (Optional[int]) – desired file size after bin-packing files, in bytes. If not provided, will attempt to read the table configuration value delta.targetFileSize. If that value isn’t set, will use default value of 256MB.
max_concurrent_tasks (Optional[int]) – the maximum number of concurrent tasks to use for file compaction. Defaults to number of CPUs. More concurrent tasks can make compaction faster, but will also use more memory.
min_commit_interval (Optional[Union[int, datetime.timedelta]]) – minimum interval in seconds or as timedeltas before a new commit is created. Interval is useful for long running executions. Set to 0 or timedelta(0), if you want a commit per partition.

Returns

the metrics from optimize

Return type

Dict[str, Any]

Examples:

Use a timedelta object to specify the seconds, minutes or hours of the interval. >>> from deltalake import DeltaTable >>> from datetime import timedelta >>> dt = DeltaTable(“tmp”) >>> time_delta = timedelta(minutes=10) >>> dt.optimize.z_order([“timestamp”], min_commit_interval=time_delta)

z_order(columns, partition_filters=None, target_size=None, max_concurrent_tasks=None, max_spill_size=21474836480, min_commit_interval=None)

Reorders the data using a Z-order curve to improve data skipping.

This also performs compaction, so the same parameters as compact() apply.

Parameters

columns (Iterable[str]) – the columns to use for Z-ordering. There must be at least one column.
partition_filters (Optional[Union[List[Tuple[str, str, Any]], List[List[Tuple[str, str, Any]]]]]) – the partition filters that will be used for getting the matched files
target_size (Optional[int]) – desired file size after bin-packing files, in bytes. If not provided, will attempt to read the table configuration value delta.targetFileSize. If that value isn’t set, will use default value of 256MB.
max_concurrent_tasks (Optional[int]) – the maximum number of concurrent tasks to use for file compaction. Defaults to number of CPUs. More concurrent tasks can make compaction faster, but will also use more memory.
max_spill_size (int) – the maximum number of bytes to spill to disk. Defaults to 20GB.
min_commit_interval (Optional[Union[int, datetime.timedelta]]) – minimum interval in seconds or as timedeltas before a new commit is created. Interval is useful for long running executions. Set to 0 or timedelta(0), if you want a commit per partition.

Returns

the metrics from optimize

Return type

Dict[str, Any]

Examples:

Use a timedelta object to specify the seconds, minutes or hours of the interval. >>> from deltalake import DeltaTable >>> from datetime import timedelta >>> dt = DeltaTable(“tmp”) >>> time_delta = timedelta(minutes=10) >>> dt.optimize.compact(min_commit_interval=time_delta)

Writing DeltaTables

deltalake.write_deltalake(table_or_uri, data, *, schema=None, partition_by=None, filesystem=None, mode='error', file_options=None, max_partitions=None, max_open_files=1024, max_rows_per_file=10485760, min_rows_per_group=65536, max_rows_per_group=131072, name=None, description=None, configuration=None, overwrite_schema=False, storage_options=None, partition_filters=None, large_dtypes=False)

Write to a Delta Lake table

If the table does not already exist, it will be created.

This function only supports writer protocol version 2 currently. When attempting to write to an existing table with a higher min_writer_version, this function will throw DeltaProtocolError.

Note that this function does NOT register this table in a data catalog.

Parameters

table_or_uri (Union[str, pathlib.Path, deltalake.table.DeltaTable]) – URI of a table or a DeltaTable object.
data (Union[pandas.core.frame.DataFrame, pyarrow.lib.Table, pyarrow.lib.RecordBatch, Iterable[pyarrow.lib.RecordBatch], pyarrow.lib.RecordBatchReader]) – Data to write. If passing iterable, the schema must also be given.
schema (Optional[pyarrow.lib.Schema]) – Optional schema to write.
partition_by (Optional[Union[List[str], str]]) – List of columns to partition the table by. Only required when creating a new table.
filesystem (Optional[pyarrow._fs.FileSystem]) – Optional filesystem to pass to PyArrow. If not provided will be inferred from uri. The file system has to be rooted in the table root. Use the pyarrow.fs.SubTreeFileSystem, to adopt the root of pyarrow file systems.
mode (Literal['error', 'append', 'overwrite', 'ignore']) – How to handle existing data. Default is to error if table already exists. If ‘append’, will add new data. If ‘overwrite’, will replace table with new data. If ‘ignore’, will not write anything if table already exists.
file_options (Optional[pyarrow._dataset_parquet.ParquetFileWriteOptions]) – Optional write options for Parquet (ParquetFileWriteOptions). Can be provided with defaults using ParquetFileWriteOptions().make_write_options(). Please refer to https://github.com/apache/arrow/blob/master/python/pyarrow/_dataset_parquet.pyx#L492-L533 for the list of available options
max_partitions (Optional[int]) – the maximum number of partitions that will be used.
max_open_files (int) – Limits the maximum number of files that can be left open while writing. If an attempt is made to open too many files then the least recently used file will be closed. If this setting is set too low you may end up fragmenting your data into many small files.
max_rows_per_file (int) – Maximum number of rows per file. If greater than 0 then this will limit how many rows are placed in any single file. Otherwise there will be no limit and one file will be created in each output directory unless files need to be closed to respect max_open_files
min_rows_per_group (int) – Minimum number of rows per group. When the value is set, the dataset writer will batch incoming data and only write the row groups to the disk when sufficient rows have accumulated.
max_rows_per_group (int) – Maximum number of rows per group. If the value is set, then the dataset writer may split up large incoming batches into multiple row groups. If this value is set, then min_rows_per_group should also be set.
name (Optional[str]) – User-provided identifier for this table.
description (Optional[str]) – User-provided description for this table.
configuration (Optional[Mapping[str, Optional[str]]]) – A map containing configuration options for the metadata action.
overwrite_schema (bool) – If True, allows updating the schema of the table.
storage_options (Optional[Dict[str, str]]) – options passed to the native delta filesystem. Unused if ‘filesystem’ is defined.
partition_filters (Optional[List[Tuple[str, str, Any]]]) – the partition filters that will be used for partition overwrite.
large_dtypes (bool) – If True, the table schema is checked against large_dtypes

Return type

None

Delta Lake Schemas

Schemas, fields, and data types are provided in the deltalake.schema submodule.

class deltalake.schema.Schema(fields)

A Delta Lake schema

Create using a list of Field:

>>> Schema([Field("x", "integer"), Field("y", "string")])
Schema([Field(x, PrimitiveType("integer"), nullable=True), Field(y, PrimitiveType("string"), nullable=True)])

Or create from a PyArrow schema:

>>> import pyarrow as pa
>>> Schema.from_pyarrow(pa.schema({"x": pa.int32(), "y": pa.string()}))
Schema([Field(x, PrimitiveType("integer"), nullable=True), Field(y, PrimitiveType("string"), nullable=True)])

static from_json(schema_json)

Create a new Schema from a JSON string.

A schema has the same JSON format as a StructType.

>>> Schema.from_json("""{
...  "type": "struct",
...  "fields": [{"name": "x", "type": "integer", "nullable": true, "metadata": {}}]
...  }""")
Schema([Field(x, PrimitiveType("integer"), nullable=True)])

Parameters: schema_json (str) – a JSON string
Return type: Schema

static from_pyarrow(data_type)

Create from a PyArrow schema

Parameters: data_type (pyarrow.Schema) – a PyArrow schema
Return type: Schema

invariants

The list of invariants on the table.

Return type: List[Tuple[str, str]]
Returns: a tuple of strings for each invariant. The first string is the field path and the second is the SQL of the invariant.

json(): DEPRECATED: Convert to JSON dictionary representation

to_json()

Get the JSON representation of the schema.

A schema has the same JSON format as a StructType.

>>> Schema([Field("x", "integer")]).to_json()
'{"type":"struct","fields":[{"name":"x","type":"integer","nullable":true,"metadata":{}}]}'

Return type: str

to_pyarrow(as_large_types=False)

Return equivalent PyArrow schema

Parameters: as_large_types – get schema with all variable size types (list, binary, string) as large variants (with int64 indices). This is for compatibility with systems like Polars that only support the large versions of Arrow types.
Return type: pyarrow.Schema

class deltalake.schema.PrimitiveType(data_type)

A primitive datatype, such as a string or number.

Can be initialized with a string value:

>>> PrimitiveType("integer")
PrimitiveType("integer")

Valid primitive data types include:

“string”,

“long”,

“integer”,

“short”,

“byte”,

“float”,

“double”,

“boolean”,

“binary”,

“date”,

“timestamp”,

“decimal(<precision>, <scale>)”

Parameters: data_type – string representation of the data type

static from_json(type_json)

Create a PrimitiveType from a JSON string

The JSON representation for a primitive type is just a quoted string:

>>> PrimitiveType.from_json('"integer"')
PrimitiveType("integer")

Parameters: type_json (str) – A JSON string
Return type: PrimitiveType

static from_pyarrow(data_type)

Create a PrimitiveType from a PyArrow type

Will raise TypeError if the PyArrow type is not a primitive type.

Parameters: data_type (pyarrow.DataType) – A PyArrow DataType
Return type: PrimitiveType

to_json()

Get the JSON string representation of the type.

Return type: str

to_pyarrow()

Get the equivalent PyArrow type.

Return type: pyarrow.DataType

type

The inner type

Return type: str

class deltalake.schema.ArrayType(element_type, contains_null=True)

An Array (List) DataType

Can either pass the element type explicitly or can pass a string if it is a primitive type:

>>> ArrayType(PrimitiveType("integer"))
ArrayType(PrimitiveType("integer"), contains_null=True)
>>> ArrayType("integer", contains_null=False)
ArrayType(PrimitiveType("integer"), contains_null=False)

contains_null

Whether the arrays may contain null values

Return type: bool

element_type

The type of the element

Return type: Union[PrimitiveType, ArrayType, MapType, StructType]

static from_json(type_json)

Create an ArrayType from a JSON string

The JSON representation for an array type is an object with type (set to "array"), elementType, and containsNull:

>>> ArrayType.from_json("""{
...   "type": "array",
...   "elementType": "integer",
...   "containsNull": false
... }""")
ArrayType(PrimitiveType("integer"), contains_null=False)

Parameters: type_json (str) – A JSON string
Return type: ArrayType

static from_pyarrow(data_type)

Create an ArrayType from a pyarrow.ListType.

Will raise TypeError if a different PyArrow DataType is provided.

Parameters: data_type (pyarrow.ListType) – The PyArrow datatype
Return type: ArrayType

to_json()

Get the JSON string representation of the type.

Return type: str

to_pyarrow()

Get the equivalent PyArrow type.

Return type: pyarrow.DataType

type

The string “array”

Return type: str

class deltalake.schema.MapType(key_type, value_type, value_contains_null=True)

A map data type

key_type and value_type should be :class PrimitiveType:, :class ArrayType:, :class ListType:, or :class StructType:. A string can also be passed, which will be parsed as a primitive type:

>>> MapType(PrimitiveType("integer"), PrimitiveType("string"))
MapType(PrimitiveType("integer"), PrimitiveType("string"), value_contains_null=True)
>>> MapType("integer", "string", value_contains_null=False)
MapType(PrimitiveType("integer"), PrimitiveType("string"), value_contains_null=False)

static from_json(type_json)

Create a MapType from a JSON string

The JSON representation for a map type is an object with type (set to map), keyType, valueType, and valueContainsNull:

>>> MapType.from_json("""{
...   "type": "map",
...   "keyType": "integer",
...   "valueType": "string",
...   "valueContainsNull": true
... }""")
MapType(PrimitiveType("integer"), PrimitiveType("string"), value_contains_null=True)

Parameters: type_json (str) – A JSON string
Return type: MapType

static from_pyarrow(data_type)

Create a MapType from a PyArrow MapType.

Will raise TypeError if passed a different type.

Parameters: data_type (pyarrow.MapType) – the PyArrow MapType
Return type: MapType

key_type

The type of the keys

Return type: Union[PrimitiveType, ArrayType, MapType, StructType]

to_json()

Get JSON string representation of map type.

Return type: str

to_pyarrow()

Get the equivalent PyArrow data type.

Return type: pyarrow.MapType

type

The string “map”

Return type: str

value_contains_null

Whether the values in a map may be null

Return type: bool

value_type

The type of the values

Return type: Union[PrimitiveType, ArrayType, MapType, StructType]

class deltalake.schema.Field(name, type, nullable=True, metadata=None)

A field in a Delta StructType or Schema

Can create with just a name and a type:

>>> Field("my_int_col", "integer")
Field("my_int_col", PrimitiveType("integer"), nullable=True, metadata=None)

Can also attach metadata to the field. Metadata should be a dictionary with string keys and JSON-serializable values (str, list, int, float, dict):

>>> Field("my_col", "integer", metadata={"custom_metadata": {"test": 2}})
Field("my_col", PrimitiveType("integer"), nullable=True, metadata={"custom_metadata": {"test": 2}})

static from_json(field_json)

Create a Field from a JSON string.

>>> Field.from_json("""{
...  "name": "col",
...  "type": "integer",
...  "nullable": true,
...  "metadata": {}
... }""")
Field(col, PrimitiveType("integer"), nullable=True)

Parameters: field_json (str) – the JSON string.
Return type: Field

static from_pyarrow(field)

Create a Field from a PyArrow field

Note: This currently doesn’t preserve field metadata.

Parameters: field – a field
Type: pyarrow.Field
Return type: Field

metadata

The metadata of the field

Return type: dict

name

The name of the field

Return type: str

nullable

Whether there may be null values in the field

Return type: bool

to_json()

Get the field as JSON string.

>>> Field("col", "integer").to_json()
'{"name":"col","type":"integer","nullable":true,"metadata":{}}'

Return type: str

to_pyarrow()

Convert to an equivalent PyArrow field

Note: This currently doesn’t preserve field metadata.

Return type: pyarrow.Field

type

The type of the field

Return type: Union[PrimitiveType, ArrayType, MapType, StructType]

class deltalake.schema.StructType(fields)

A struct datatype, containing one or more subfields

Create with a list of Field:

>>> StructType([Field("x", "integer"), Field("y", "string")])
StructType([Field(x, PrimitiveType("integer"), nullable=True), Field(y, PrimitiveType("string"), nullable=True)])

fields

The fields within the struct

Return type: List[Field]

static from_json(type_json)

Create a new StructType from a JSON string.

>>> StructType.from_json("""{
...  "type": "struct",
...  "fields": [{"name": "x", "type": "integer", "nullable": true, "metadata": {}}]
...  }""")
StructType([Field(x, PrimitiveType("integer"), nullable=True)])

Parameters: type_json (str) – a JSON string
Return type: StructType

static from_pyarrow(data_type)

Create a new StructType from a PyArrow struct type.

Will raise TypeError if a different data type is provided.

Parameters: data_type (pyarrow.StructType) – a PyArrow struct type.
Return type: StructType

to_json()

Get the JSON representation of the type.

>>> StructType([Field("x", "integer")]).to_json()
'{"type":"struct","fields":[{"name":"x","type":"integer","nullable":true,"metadata":{}}]}'

Return type: str

to_pyarrow()

Get the equivalent PyArrow StructType

Return type: pyarrow.StructType

type: The string “struct”

DataCatalog

class deltalake.data_catalog.DataCatalog(value)

List of the Data Catalogs

AWS = 'glue': Refers to the AWS Glue Data Catalog

UNITY = 'unity': Refers to the Databricks Unity Catalog

DeltaStorageHandler

class deltalake.fs.DeltaStorageHandler(table_uri, options=None, known_sizes=None)

DeltaStorageHandler is a concrete implementations of a PyArrow FileSystemHandler.

get_file_info_selector(selector)

Get info for the files defined by FileSelector.

Parameters: selector (pyarrow._fs.FileSelector) – FileSelector object
Returns: list of file info objects
Return type: List[pyarrow._fs.FileInfo]

open_input_file(path)

Open an input file for random access reading.

Parameters

source – The source to open for reading.
path (str) –

Returns

NativeFile

Return type

pyarrow.lib.PythonFile

open_input_stream(path)

Open an input stream for sequential reading.

Parameters

source – The source to open for reading.
path (str) –

Returns

NativeFile

Return type

pyarrow.lib.PythonFile

open_output_stream(path, metadata=None)

Open an output stream for sequential writing.

If the target already exists, existing data is truncated.

Parameters

path (str) – The source to open for writing.
metadata (Optional[Dict[str, str]]) – If not None, a mapping of string keys to string values.

Returns

NativeFile

Return type

pyarrow.lib.PythonFile