API Reference

DeltaTable

class deltalake.table.DeltaTable(table_uri, version=None, storage_options=None, without_files=False)

Create a DeltaTable instance.

Parameters
  • table_uri (str) –

  • version (Optional[int]) –

  • storage_options (Optional[Dict[str, str]]) –

  • without_files (bool) –

file_uris(partition_filters=None)

Get the list of files as absolute URIs, including the scheme (e.g. “s3://”).

Local files will be just plain absolute paths, without a scheme. (That is, no ‘file://’ prefix.)

Use the partition_filters parameter to retrieve a subset of files that match the given filters.

Parameters

partition_filters (Optional[List[Tuple[str, str, Any]]]) – the partition filters that will be used for getting the matched files

Returns

list of the .parquet files with an absolute URI referenced for the current version of the DeltaTable

Return type

List[str]

Predicates are expressed in disjunctive normal form (DNF), like [(“x”, “=”, “a”), …]. DNF allows arbitrary boolean logical combinations of single partition predicates. The innermost tuples each describe a single partition predicate. The list of inner predicates is interpreted as a conjunction (AND), forming a more selective and multiple partition predicates. Each tuple has format: (key, op, value) and compares the key with the value. The supported op are: =, !=, in, and not in. If the op is in or not in, the value must be a collection such as a list, a set or a tuple. The supported type for value is str. Use empty string ‘’ for Null partition value.

Examples: (“x”, “=”, “a”) (“x”, “!=”, “a”) (“y”, “in”, [“a”, “b”, “c”]) (“z”, “not in”, [“a”,”b”])

files(partition_filters=None)

Get the .parquet files of the DeltaTable.

The paths are as they are saved in the delta log, which may either be relative to the table root or absolute URIs.

Parameters

partition_filters (Optional[List[Tuple[str, str, Any]]]) – the partition filters that will be used for getting the matched files

Returns

list of the .parquet files referenced for the current version of the DeltaTable

Return type

List[str]

Predicates are expressed in disjunctive normal form (DNF), like [(“x”, “=”, “a”), …]. DNF allows arbitrary boolean logical combinations of single partition predicates. The innermost tuples each describe a single partition predicate. The list of inner predicates is interpreted as a conjunction (AND), forming a more selective and multiple partition predicates. Each tuple has format: (key, op, value) and compares the key with the value. The supported op are: =, !=, in, and not in. If the op is in or not in, the value must be a collection such as a list, a set or a tuple. The supported type for value is str. Use empty string ‘’ for Null partition value.

Examples: (“x”, “=”, “a”) (“x”, “!=”, “a”) (“y”, “in”, [“a”, “b”, “c”]) (“z”, “not in”, [“a”,”b”])

files_by_partitions(partition_filters)

Get the files that match a given list of partitions filters.

Deprecated since version 0.7.0: Use file_uris() instead.

Partitions which do not match the filter predicate will be removed from scanned data. Predicates are expressed in disjunctive normal form (DNF), like [(“x”, “=”, “a”), …]. DNF allows arbitrary boolean logical combinations of single partition predicates. The innermost tuples each describe a single partition predicate. The list of inner predicates is interpreted as a conjunction (AND), forming a more selective and multiple partition predicates. Each tuple has format: (key, op, value) and compares the key with the value. The supported op are: =, !=, in, and not in. If the op is in or not in, the value must be a collection such as a list, a set or a tuple. The supported type for value is str. Use empty string ‘’ for Null partition value.

Examples: (“x”, “=”, “a”) (“x”, “!=”, “a”) (“y”, “in”, [“a”, “b”, “c”]) (“z”, “not in”, [“a”,”b”])

Parameters

partition_filters (List[Tuple[str, str, Any]]) – the partition filters that will be used for getting the matched files

Returns

list of the .parquet files after applying the partition filters referenced for the current version of the DeltaTable.

Return type

List[str]

classmethod from_data_catalog(data_catalog, database_name, table_name, data_catalog_id=None, version=None)

Create the Delta Table from a Data Catalog.

Parameters
  • data_catalog (deltalake.data_catalog.DataCatalog) – the Catalog to use for getting the storage location of the Delta Table

  • database_name (str) – the database name inside the Data Catalog

  • table_name (str) – the table name inside the Data Catalog

  • data_catalog_id (Optional[str]) – the identifier of the Data Catalog

  • version (Optional[int]) – version of the DeltaTable

Return type

deltalake.table.DeltaTable

get_add_actions(flatten=False)

Return a dataframe with all current add actions.

Add actions represent the files that currently make up the table. This data is a low-level representation parsed from the transaction log.

Parameters

flatten (bool) – whether to flatten the schema. Partition values columns are given the prefix partition., statistics (null_count, min, and max) are given the prefix null_count., min., and max., and tags the prefix tags.. Nested field names are concatenated with ..

Returns

a PyArrow RecordBatch containing the add action data.

Return type

pyarrow.lib.RecordBatch

Examples:

>>> from deltalake import DeltaTable, write_deltalake
>>> import pyarrow as pa
>>> data = pa.table({"x": [1, 2, 3], "y": [4, 5, 6]})
>>> write_deltalake("tmp", data, partition_by=["x"])
>>> dt = DeltaTable("tmp")
>>> dt.get_add_actions_df().to_pandas()
                                                        path  size_bytes       modification_time  data_change partition_values  num_records null_count       min       max
0  x=2/0-91820cbf-f698-45fb-886d-5d5f5669530b-0.p...         565 1970-01-20 08:40:08.071         True         {'x': 2}            1   {'y': 0}  {'y': 5}  {'y': 5}
1  x=3/0-91820cbf-f698-45fb-886d-5d5f5669530b-0.p...         565 1970-01-20 08:40:08.071         True         {'x': 3}            1   {'y': 0}  {'y': 6}  {'y': 6}
2  x=1/0-91820cbf-f698-45fb-886d-5d5f5669530b-0.p...         565 1970-01-20 08:40:08.071         True         {'x': 1}            1   {'y': 0}  {'y': 4}  {'y': 4}
>>> dt.get_add_actions_df(flatten=True).to_pandas()
                                                        path  size_bytes       modification_time  data_change  partition.x  num_records  null_count.y  min.y  max.y
0  x=2/0-91820cbf-f698-45fb-886d-5d5f5669530b-0.p...         565 1970-01-20 08:40:08.071         True            2            1             0      5      5
1  x=3/0-91820cbf-f698-45fb-886d-5d5f5669530b-0.p...         565 1970-01-20 08:40:08.071         True            3            1             0      6      6
2  x=1/0-91820cbf-f698-45fb-886d-5d5f5669530b-0.p...         565 1970-01-20 08:40:08.071         True            1            1             0      4      4
history(limit=None)

Run the history command on the DeltaTable. The operations are returned in reverse chronological order.

Parameters

limit (Optional[int]) – the commit info limit to return

Returns

list of the commit infos registered in the transaction log

Return type

List[Dict[str, Any]]

load_version(version)

Load a DeltaTable with a specified version.

Parameters

version (int) – the identifier of the version of the DeltaTable to load

Return type

None

load_with_datetime(datetime_string)

Time travel Delta table to the latest version that’s created at or before provided datetime_string argument. The datetime_string argument should be an RFC 3339 and ISO 8601 date and time string.

Examples: 2018-01-26T18:30:09Z 2018-12-19T16:39:57-08:00 2018-01-26T18:30:09.453+00:00

Parameters

datetime_string (str) – the identifier of the datetime point of the DeltaTable to load

Return type

None

metadata()

Get the current metadata of the DeltaTable.

Returns

the current Metadata registered in the transaction log

Return type

deltalake.table.Metadata

pyarrow_schema()

Get the current schema of the DeltaTable with the Parquet PyArrow format.

DEPRECATED: use DeltaTable.schema().to_pyarrow() instead.

Returns

the current Schema with the Parquet PyArrow format

Return type

pyarrow.lib.Schema

schema()

Get the current schema of the DeltaTable.

Returns

the current Schema registered in the transaction log

Return type

deltalake.schema.Schema

to_pandas(partitions=None, columns=None, filesystem=None)

Build a pandas dataframe using data from the DeltaTable.

Parameters
  • partitions (Optional[List[Tuple[str, str, Any]]]) – A list of partition filters, see help(DeltaTable.files_by_partitions) for filter syntax

  • columns (Optional[List[str]]) – The columns to project. This can be a list of column names to include (order and duplicates will be preserved)

  • filesystem (Optional[Union[str, pyarrow._fs.FileSystem]]) – A concrete implementation of the Pyarrow FileSystem or a fsspec-compatible interface. If None, the first file path will be used to determine the right FileSystem

Returns

a pandas dataframe

Return type

pandas.DataFrame

to_pyarrow_dataset(partitions=None, filesystem=None, parquet_read_options=None)

Build a PyArrow Dataset using data from the DeltaTable.

Parameters
Returns

the PyArrow dataset in PyArrow

Return type

pyarrow._dataset.Dataset

to_pyarrow_table(partitions=None, columns=None, filesystem=None)

Build a PyArrow Table using data from the DeltaTable.

Parameters
  • partitions (Optional[List[Tuple[str, str, Any]]]) – A list of partition filters, see help(DeltaTable.files_by_partitions) for filter syntax

  • columns (Optional[List[str]]) – The columns to project. This can be a list of column names to include (order and duplicates will be preserved)

  • filesystem (Optional[Union[str, pyarrow._fs.FileSystem]]) – A concrete implementation of the Pyarrow FileSystem or a fsspec-compatible interface. If None, the first file path will be used to determine the right FileSystem

Returns

the PyArrow table

Return type

pyarrow.lib.Table

update_incremental()

Updates the DeltaTable to the latest version by incrementally applying newer versions.

Return type

None

vacuum(retention_hours=None, dry_run=True, enforce_retention_duration=True)

Run the Vacuum command on the Delta Table: list and delete files no longer referenced by the Delta table and are older than the retention threshold.

Parameters
  • retention_hours (Optional[int]) – the retention threshold in hours, if none then the value from configuration.deletedFileRetentionDuration is used or default of 1 week otherwise.

  • dry_run (bool) – when activated, list only the files, delete otherwise

  • enforce_retention_duration (bool) – when disabled, accepts retention hours smaller than the value from configuration.deletedFileRetentionDuration.

Returns

the list of files no longer referenced by the Delta Table and are older than the retention threshold.

Return type

List[str]

version()

Get the version of the DeltaTable.

Returns

The current version of the DeltaTable

Return type

int

exception deltalake.table.DeltaTableProtocolError
class deltalake.table.Metadata(table)

Create a Metadata instance.

Parameters

table (RawDeltaTable) –

property configuration: Dict[str, str]

Return the DeltaTable properties.

property created_time: int

Return The time when this metadata action is created, in milliseconds since the Unix epoch of the DeltaTable.

property description: str

Return the user-provided description of the DeltaTable.

property id: int

Return the unique identifier of the DeltaTable.

property name: str

Return the user-provided identifier of the DeltaTable.

property partition_columns: List[str]

Return an array containing the names of the partitioned columns of the DeltaTable.

class deltalake.table.ProtocolVersions(min_reader_version, min_writer_version)
Parameters
  • min_reader_version (int) –

  • min_writer_version (int) –

min_reader_version: int

Alias for field number 0

min_writer_version: int

Alias for field number 1

Writing DeltaTables

deltalake.write_deltalake(table_or_uri, data, *, schema=None, partition_by=None, filesystem=None, mode='error', file_options=None, max_open_files=1024, max_rows_per_file=10485760, min_rows_per_group=65536, max_rows_per_group=131072, name=None, description=None, configuration=None, overwrite_schema=False, storage_options=None)

Write to a Delta Lake table (Experimental)

If the table does not already exist, it will be created.

This function only supports protocol version 1 currently. If an attempting to write to an existing table with a higher min_writer_version, this function will throw DeltaTableProtocolError.

Note that this function does NOT register this table in a data catalog.

Parameters
  • table_or_uri (Union[str, deltalake.table.DeltaTable]) – URI of a table or a DeltaTable object.

  • data (Union[pandas.core.frame.DataFrame, pyarrow.lib.Table, pyarrow.lib.RecordBatch, Iterable[pyarrow.lib.RecordBatch], pyarrow.lib.RecordBatchReader]) – Data to write. If passing iterable, the schema must also be given.

  • schema (Optional[pyarrow.lib.Schema]) – Optional schema to write.

  • partition_by (Optional[List[str]]) – List of columns to partition the table by. Only required when creating a new table.

  • filesystem (Optional[pyarrow._fs.FileSystem]) – Optional filesystem to pass to PyArrow. If not provided will be inferred from uri. The file system has to be rooted in the table root. Use the pyarrow.fs.SubTreeFileSystem, to adopt the root of pyarrow file systems.

  • mode (Literal['error', 'append', 'overwrite', 'ignore']) – How to handle existing data. Default is to error if table already exists. If ‘append’, will add new data. If ‘overwrite’, will replace table with new data. If ‘ignore’, will not write anything if table already exists.

  • file_options (Optional[pyarrow._dataset_parquet.ParquetFileWriteOptions]) – Optional write options for Parquet (ParquetFileWriteOptions). Can be provided with defaults using ParquetFileWriteOptions().make_write_options(). Please refer to https://github.com/apache/arrow/blob/master/python/pyarrow/_dataset_parquet.pyx#L492-L533 for the list of available options

  • max_open_files (int) – Limits the maximum number of files that can be left open while writing. If an attempt is made to open too many files then the least recently used file will be closed. If this setting is set too low you may end up fragmenting your data into many small files.

  • max_rows_per_file (int) – Maximum number of rows per file. If greater than 0 then this will limit how many rows are placed in any single file. Otherwise there will be no limit and one file will be created in each output directory unless files need to be closed to respect max_open_files

  • min_rows_per_group (int) – Minimum number of rows per group. When the value is set, the dataset writer will batch incoming data and only write the row groups to the disk when sufficient rows have accumulated.

  • max_rows_per_group (int) – Maximum number of rows per group. If the value is set, then the dataset writer may split up large incoming batches into multiple row groups. If this value is set, then min_rows_per_group should also be set.

  • name (Optional[str]) – User-provided identifier for this table.

  • description (Optional[str]) – User-provided description for this table.

  • configuration (Optional[Mapping[str, Optional[str]]]) – A map containing configuration options for the metadata action.

  • overwrite_schema (bool) – If True, allows updating the schema of the table.

  • storage_options (Optional[Dict[str, str]]) – options passed to the native delta filesystem. Unused if ‘filesystem’ is defined.

Return type

None

Delta Lake Schemas

Schemas, fields, and data types are provided in the deltalake.schema submodule.

class deltalake.schema.Schema(fields)

A Delta Lake schema

Create using a list of Field:

>>> Schema([Field("x", "integer"), Field("y", "string")])
Schema([Field(x, PrimitiveType("integer"), nullable=True), Field(y, PrimitiveType("string"), nullable=True)])

Or create from a PyArrow schema:

>>> import pyarrow as pa
>>> Schema.from_pyarrow(pa.schema({"x": pa.int32(), "y": pa.string()}))
Schema([Field(x, PrimitiveType("integer"), nullable=True), Field(y, PrimitiveType("string"), nullable=True)])
static from_json(schema_json)

Create a new Schema from a JSON string.

A schema has the same JSON format as a StructType.

>>> Schema.from_json("""{
...  "type": "struct",
...  "fields": [{"name": "x", "type": "integer", "nullable": true, "metadata": {}}]
...  }""")
Schema([Field(x, PrimitiveType("integer"), nullable=True)])
Parameters

schema_json (str) – a JSON string

Return type

Schema

static from_pyarrow(data_type)

Create from a PyArrow schema

Parameters

data_type (pyarrow.Schema) – a PyArrow schema

Return type

Schema

invariants

The list of invariants on the table.

Return type

List[Tuple[str, str]]

Returns

a tuple of strings for each invariant. The first string is the field path and the second is the SQL of the invariant.

json()

DEPRECATED: Convert to JSON dictionary representation

to_json()

Get the JSON representation of the schema.

A schema has the same JSON format as a StructType.

>>> Schema([Field("x", "integer")]).to_json()
'{"type":"struct","fields":[{"name":"x","type":"integer","nullable":true,"metadata":{}}]}'
Return type

str

to_pyarrow()

Return equivalent PyArrow schema

Return type

pyarrow.Schema

class deltalake.schema.PrimitiveType(data_type)

A primitive datatype, such as a string or number.

Can be initialized with a string value:

>>> PrimitiveType("integer")
PrimitiveType("integer")

Valid primitive data types include:

  • “string”,

  • “long”,

  • “integer”,

  • “short”,

  • “byte”,

  • “float”,

  • “double”,

  • “boolean”,

  • “binary”,

  • “date”,

  • “timestamp”,

  • “decimal(<scale>, <precision>)”

Parameters

data_type – string representation of the data type

static from_json(type_json)

Create a PrimitiveType from a JSON string

The JSON representation for a primitive type is just a quoted string:

>>> PrimitiveType.from_json('"integer"')
PrimitiveType("integer")
Parameters

type_json (str) – A JSON string

Return type

PrimitiveType

static from_pyarrow(data_type)

Create a PrimitiveType from a PyArrow type

Will raise TypeError if the PyArrow type is not a primitive type.

Parameters

data_type (pyarrow.DataType) – A PyArrow DataType

Return type

PrimitiveType

to_json()

Get the JSON string representation of the type.

Return type

str

to_pyarrow()

Get the equivalent PyArrow type.

Return type

pyarrow.DataType

type

The inner type

Return type

str

class deltalake.schema.ArrayType(element_type, contains_null=True)

An Array (List) DataType

Can either pass the element type explicitly or can pass a string if it is a primitive type:

>>> ArrayType(PrimitiveType("integer"))
ArrayType(PrimitiveType("integer"), contains_null=True)
>>> ArrayType("integer", contains_null=False)
ArrayType(PrimitiveType("integer"), contains_null=False)
contains_null

Whether the arrays may contain null values

Return type

bool

element_type

The type of the element

Return type

Union[PrimitiveType, ArrayType, MapType, StructType]

static from_json(type_json)

Create an ArrayType from a JSON string

The JSON representation for an array type is an object with type (set to "array"), elementType, and containsNull:

>>> ArrayType.from_json("""{
...   "type": "array",
...   "elementType": "integer",
...   "containsNull": false
... }""")
ArrayType(PrimitiveType("integer"), contains_null=False)
Parameters

type_json (str) – A JSON string

Return type

ArrayType

static from_pyarrow(data_type)

Create an ArrayType from a pyarrow.ListType.

Will raise TypeError if a different PyArrow DataType is provided.

Parameters

data_type (pyarrow.ListType) – The PyArrow datatype

Return type

ArrayType

to_json()

Get the JSON string representation of the type.

Return type

str

to_pyarrow()

Get the equivalent PyArrow type.

Return type

pyarrow.DataType

type

The string “array”

Return type

str

class deltalake.schema.MapType(key_type, value_type, value_contains_null=True)

A map data type

key_type and value_type should be :class PrimitiveType:, :class ArrayType:, :class ListType:, or :class StructType:. A string can also be passed, which will be parsed as a primitive type:

>>> MapType(PrimitiveType("integer"), PrimitiveType("string"))
MapType(PrimitiveType("integer"), PrimitiveType("string"), value_contains_null=True)
>>> MapType("integer", "string", value_contains_null=False)
MapType(PrimitiveType("integer"), PrimitiveType("string"), value_contains_null=False)
static from_json(type_json)

Create a MapType from a JSON string

The JSON representation for a map type is an object with type (set to map), keyType, valueType, and valueContainsNull:

>>> MapType.from_json("""{
...   "type": "map",
...   "keyType": "integer",
...   "valueType": "string",
...   "valueContainsNull": true
... }""")
MapType(PrimitiveType("integer"), PrimitiveType("string"), value_contains_null=True)
Parameters

type_json (str) – A JSON string

Return type

MapType

static from_pyarrow(data_type)

Create a MapType from a PyArrow MapType.

Will raise TypeError if passed a different type.

Parameters

data_type (pyarrow.MapType) – the PyArrow MapType

Return type

MapType

key_type

The type of the keys

Return type

Union[PrimitiveType, ArrayType, MapType, StructType]

to_json()

Get JSON string representation of map type.

Return type

str

to_pyarrow()

Get the equivalent PyArrow data type.

Return type

pyarrow.MapType

type

The string “map”

Return type

str

value_contains_null

Whether the values in a map may be null

Return type

bool

value_type

The type of the values

Return type

Union[PrimitiveType, ArrayType, MapType, StructType]

class deltalake.schema.Field(name, ty, nullable=True, metadata=None)

A field in a Delta StructType or Schema

Can create with just a name and a type:

>>> Field("my_int_col", "integer")
Field("my_int_col", PrimitiveType("integer"), nullable=True, metadata=None)

Can also attach metadata to the field. Metadata should be a dictionary with string keys and JSON-serializable values (str, list, int, float, dict):

>>> Field("my_col", "integer", metadata={"custom_metadata": {"test": 2}})
Field("my_col", PrimitiveType("integer"), nullable=True, metadata={"custom_metadata": {"test": 2}})
static from_json(field_json)

Create a Field from a JSON string.

>>> Field.from_json("""{
...  "name": "col",
...  "type": "integer",
...  "nullable": true,
...  "metadata": {}
... }""")
Field(col, PrimitiveType("integer"), nullable=True)
Parameters

field_json (str) – the JSON string.

Return type

Field

static from_pyarrow(field)

Create a Field from a PyArrow field

Note: This currently doesn’t preserve field metadata.

Parameters

field – a field

Type

pyarrow.Field

Return type

Field

metadata

The metadata of the field

Return type

dict

name

The name of the field

Return type

str

nullable

Whether there may be null values in the field

Return type

bool

to_json()

Get the field as JSON string.

>>> Field("col", "integer").to_json()
'{"name":"col","type":"integer","nullable":true,"metadata":{}}'
Return type

str

to_pyarrow()

Convert to an equivalent PyArrow field

Note: This currently doesn’t preserve field metadata.

Return type

pyarrow.Field

type

The type of the field

Return type

Union[PrimitiveType, ArrayType, MapType, StructType]

class deltalake.schema.StructType(fields)

A struct datatype, containing one or more subfields

Create with a list of Field:

>>> StructType([Field("x", "integer"), Field("y", "string")])
StructType([Field(x, PrimitiveType("integer"), nullable=True), Field(y, PrimitiveType("string"), nullable=True)])
fields

The fields within the struct

Return type

List[Field]

static from_json(type_json)

Create a new StructType from a JSON string.

>>> StructType.from_json("""{
...  "type": "struct",
...  "fields": [{"name": "x", "type": "integer", "nullable": true, "metadata": {}}]
...  }""")
StructType([Field(x, PrimitiveType("integer"), nullable=True)])
Parameters

type_json (str) – a JSON string

Return type

StructType

static from_pyarrow(data_type)

Create a new StructType from a PyArrow struct type.

Will raise TypeError if a different data type is provided.

Parameters

data_type (pyarrow.StructType) – a PyArrow struct type.

Return type

StructType

to_json()

Get the JSON representation of the type.

>>> StructType([Field("x", "integer")]).to_json()
'{"type":"struct","fields":[{"name":"x","type":"integer","nullable":true,"metadata":{}}]}'
Return type

str

to_pyarrow()

Get the equivalent PyArrow StructType

Return type

pyarrow.StructType

type

The string “struct”

DataCatalog

class deltalake.data_catalog.DataCatalog(value)

List of the Data Catalogs

DeltaStorageHandler

class deltalake.fs.DeltaStorageHandler

DeltaStorageHandler is a concrete implementations of a PyArrow FileSystemHandler.

get_file_info_selector(selector)

Get info for the files defined by FileSelector.

Parameters

selector (pyarrow._fs.FileSelector) – FileSelector object

Returns

list of file info objects

Return type

List[pyarrow._fs.FileInfo]

open_input_file(path)

Open an input file for random access reading.

Parameters
  • source – The source to open for reading.

  • path (str) –

Returns

NativeFile

Return type

pyarrow.lib.PythonFile

open_input_stream(path)

Open an input stream for sequential reading.

Parameters
  • source – The source to open for reading.

  • path (str) –

Returns

NativeFile

Return type

pyarrow.lib.PythonFile

open_output_stream(path, metadata=None)

Open an output stream for sequential writing.

If the target already exists, existing data is truncated.

Parameters
  • path (str) – The source to open for writing.

  • metadata (Optional[Dict[str, str]]) – If not None, a mapping of string keys to string values.

Returns

NativeFile

Return type

pyarrow.lib.PythonFile