API Reference
DeltaTable
- class deltalake.table.DeltaTable(table_uri, version=None, storage_options=None)
Create a DeltaTable instance.
- Parameters
table_uri (str) –
version (Optional[int]) –
storage_options (Optional[Dict[str, str]]) –
- file_paths()
Get the list of files with an absolute path.
- Returns
list of the .parquet files with an absolute URI referenced for the current version of the DeltaTable
- Return type
List[str]
- file_uris()
Get the list of files with an absolute path.
- Returns
list of the .parquet files with an absolute URI referenced for the current version of the DeltaTable
- Return type
List[str]
- files()
Get the .parquet files of the DeltaTable.
- Returns
list of the .parquet files referenced for the current version of the DeltaTable
- Return type
List[str]
- files_by_partitions(partition_filters)
Get the files that match a given list of partitions filters. Partitions which do not match the filter predicate will be removed from scanned data. Predicates are expressed in disjunctive normal form (DNF), like [(“x”, “=”, “a”), …]. DNF allows arbitrary boolean logical combinations of single partition predicates. The innermost tuples each describe a single partition predicate. The list of inner predicates is interpreted as a conjunction (AND), forming a more selective and multiple partition predicates. Each tuple has format: (key, op, value) and compares the key with the value. The supported op are: =, !=, in, and not in. If the op is in or not in, the value must be a collection such as a list, a set or a tuple. The supported type for value is str. Use empty string ‘’ for Null partition value.
Examples: (“x”, “=”, “a”) (“x”, “!=”, “a”) (“y”, “in”, [“a”, “b”, “c”]) (“z”, “not in”, [“a”,”b”])
- Parameters
partition_filters (List[Tuple[str, str, Any]]) – the partition filters that will be used for getting the matched files
- Returns
list of the .parquet files after applying the partition filters referenced for the current version of the DeltaTable.
- Return type
List[str]
- classmethod from_data_catalog(data_catalog, database_name, table_name, data_catalog_id=None, version=None)
Create the Delta Table from a Data Catalog.
- Parameters
data_catalog (deltalake.data_catalog.DataCatalog) – the Catalog to use for getting the storage location of the Delta Table
database_name (str) – the database name inside the Data Catalog
table_name (str) – the table name inside the Data Catalog
data_catalog_id (Optional[str]) – the identifier of the Data Catalog
version (Optional[int]) – version of the DeltaTable
- Return type
- history(limit=None)
Run the history command on the DeltaTable. The operations are returned in reverse chronological order.
- Parameters
limit (Optional[int]) – the commit info limit to return
- Returns
list of the commit infos registered in the transaction log
- Return type
List[Dict[str, Any]]
- load_version(version)
Load a DeltaTable with a specified version.
- Parameters
version (int) – the identifier of the version of the DeltaTable to load
- Return type
None
- load_with_datetime(datetime_string)
Time travel Delta table to the latest version that’s created at or before provided datetime_string argument. The datetime_string argument should be an RFC 3339 and ISO 8601 date and time string.
Examples: 2018-01-26T18:30:09Z 2018-12-19T16:39:57-08:00 2018-01-26T18:30:09.453+00:00
- Parameters
datetime_string (str) – the identifier of the datetime point of the DeltaTable to load
- Return type
None
- metadata()
Get the current metadata of the DeltaTable.
- Returns
the current Metadata registered in the transaction log
- Return type
- pyarrow_schema()
Get the current schema of the DeltaTable with the Parquet PyArrow format.
- Returns
the current Schema with the Parquet PyArrow format
- Return type
- schema()
Get the current schema of the DeltaTable.
- Returns
the current Schema registered in the transaction log
- Return type
- to_pandas(partitions=None, columns=None, filesystem=None)
Build a pandas dataframe using data from the DeltaTable.
- Parameters
partitions (Optional[List[Tuple[str, str, Any]]]) – A list of partition filters, see help(DeltaTable.files_by_partitions) for filter syntax
columns (Optional[List[str]]) – The columns to project. This can be a list of column names to include (order and duplicates will be preserved)
filesystem (Optional[Union[str, pyarrow._fs.FileSystem]]) – A concrete implementation of the Pyarrow FileSystem or a fsspec-compatible interface. If None, the first file path will be used to determine the right FileSystem
- Returns
a pandas dataframe
- Return type
- to_pyarrow_dataset(partitions=None, filesystem=None, parquet_read_options=None)
Build a PyArrow Dataset using data from the DeltaTable.
- Parameters
partitions (Optional[List[Tuple[str, str, Any]]]) – A list of partition filters, see help(DeltaTable.files_by_partitions) for filter syntax
filesystem (Optional[Union[str, pyarrow._fs.FileSystem]]) – A concrete implementation of the Pyarrow FileSystem or a fsspec-compatible interface. If None, the first file path will be used to determine the right FileSystem
parquet_read_options (Optional[pyarrow._dataset_parquet.ParquetReadOptions]) – Optional read options for Parquet. Use this to handle INT96 to timestamp conversion for edge cases like 0001-01-01 or 9999-12-31 More info: https://arrow.apache.org/docs/python/generated/pyarrow.dataset.ParquetReadOptions.html
- Returns
the PyArrow dataset in PyArrow
- Return type
- to_pyarrow_table(partitions=None, columns=None, filesystem=None)
Build a PyArrow Table using data from the DeltaTable.
- Parameters
partitions (Optional[List[Tuple[str, str, Any]]]) – A list of partition filters, see help(DeltaTable.files_by_partitions) for filter syntax
columns (Optional[List[str]]) – The columns to project. This can be a list of column names to include (order and duplicates will be preserved)
filesystem (Optional[Union[str, pyarrow._fs.FileSystem]]) – A concrete implementation of the Pyarrow FileSystem or a fsspec-compatible interface. If None, the first file path will be used to determine the right FileSystem
- Returns
the PyArrow table
- Return type
- update_incremental()
Updates the DeltaTable to the latest version by incrementally applying newer versions.
- Return type
None
- vacuum(retention_hours=None, dry_run=True, enforce_retention_duration=True)
Run the Vacuum command on the Delta Table: list and delete files no longer referenced by the Delta table and are older than the retention threshold.
- Parameters
retention_hours (Optional[int]) – the retention threshold in hours, if none then the value from configuration.deletedFileRetentionDuration is used or default of 1 week otherwise.
dry_run (bool) – when activated, list only the files, delete otherwise
enforce_retention_duration (bool) – when disabled, accepts retention hours smaller than the value from configuration.deletedFileRetentionDuration.
- Returns
the list of files no longer referenced by the Delta Table and are older than the retention threshold.
- Return type
List[str]
- version()
Get the version of the DeltaTable.
- Returns
The current version of the DeltaTable
- Return type
int
- class deltalake.table.Metadata(table)
Create a Metadata instance.
- Parameters
table (RawDeltaTable) –
- property configuration: Dict[str, str]
Return the DeltaTable properties.
- property created_time: int
Return The time when this metadata action is created, in milliseconds since the Unix epoch of the DeltaTable.
- property description: str
Return the user-provided description of the DeltaTable.
- property id: int
Return the unique identifier of the DeltaTable.
- property name: str
Return the user-provided identifier of the DeltaTable.
- property partition_columns: List[str]
Return an array containing the names of the partitioned columns of the DeltaTable.
Writing DeltaTables
- deltalake.write_deltalake(table_or_uri, data, *, schema=None, partition_by=None, filesystem=None, mode='error', file_options=None, max_open_files=1024, max_rows_per_file=0, min_rows_per_group=0, max_rows_per_group=1048576, name=None, description=None, configuration=None, overwrite_schema=False)
Write to a Delta Lake table (Experimental)
If the table does not already exist, it will be created.
This function only supports protocol version 1 currently. If an attempting to write to an existing table with a higher min_writer_version, this function will throw DeltaTableProtocolError.
Note that this function does NOT register this table in a data catalog.
- Parameters
table_or_uri (Union[str, deltalake.table.DeltaTable]) – URI of a table or a DeltaTable object.
data (Union[pandas.core.frame.DataFrame, pyarrow.lib.Table, pyarrow.lib.RecordBatch, Iterable[pyarrow.lib.RecordBatch], pyarrow.lib.RecordBatchReader]) – Data to write. If passing iterable, the schema must also be given.
schema (Optional[pyarrow.lib.Schema]) – Optional schema to write.
partition_by (Optional[List[str]]) – List of columns to partition the table by. Only required when creating a new table.
filesystem (Optional[pyarrow._fs.FileSystem]) – Optional filesystem to pass to PyArrow. If not provided will be inferred from uri.
mode (Literal['error', 'append', 'overwrite', 'ignore']) – How to handle existing data. Default is to error if table already exists. If ‘append’, will add new data. If ‘overwrite’, will replace table with new data. If ‘ignore’, will not write anything if table already exists.
file_options (Optional[pyarrow._dataset_parquet.ParquetFileWriteOptions]) – Optional write options for Parquet (ParquetFileWriteOptions). Can be provided with defaults using ParquetFileWriteOptions().make_write_options(). Please refer to https://github.com/apache/arrow/blob/master/python/pyarrow/_dataset_parquet.pyx#L492-L533 for the list of available options
max_open_files (int) – Limits the maximum number of files that can be left open while writing. If an attempt is made to open too many files then the least recently used file will be closed. If this setting is set too low you may end up fragmenting your data into many small files.
max_rows_per_file (int) – Maximum number of rows per file. If greater than 0 then this will limit how many rows are placed in any single file. Otherwise there will be no limit and one file will be created in each output directory unless files need to be closed to respect max_open_files
min_rows_per_group (int) – Minimum number of rows per group. When the value is set, the dataset writer will batch incoming data and only write the row groups to the disk when sufficient rows have accumulated.
max_rows_per_group (int) – Maximum number of rows per group. If the value is set, then the dataset writer may split up large incoming batches into multiple row groups. If this value is set, then min_rows_per_group should also be set.
name (Optional[str]) – User-provided identifier for this table.
description (Optional[str]) – User-provided description for this table.
configuration (Optional[Mapping[str, Optional[str]]]) – A map containing configuration options for the metadata action.
overwrite_schema (bool) – If True, allows updating the schema of the table.
- Return type
None
DeltaSchema
- class deltalake.schema.ArrayType(element_type, contains_null)
Concrete class for array data types.
- Parameters
element_type (deltalake.schema.DataType) –
contains_null (bool) –
- class deltalake.schema.DataType(type)
Base class of all Delta data types.
- Parameters
type (str) –
- Return type
None
- classmethod from_dict(json_dict)
Generate a DataType from a DataType in json format.
- Parameters
json_dict (Dict[str, Any]) – the data type in json format
- Returns
the Delta DataType
- Return type
- class deltalake.schema.Field(name, type, nullable, metadata=None)
Create a DeltaTable Field instance.
- Parameters
name (str) –
type (deltalake.schema.DataType) –
nullable (bool) –
metadata (Optional[Dict[str, str]]) –
- Return type
None
- class deltalake.schema.MapType(key_type, value_type, value_contains_null)
Concrete class for map data types.
- Parameters
key_type (deltalake.schema.DataType) –
value_type (deltalake.schema.DataType) –
value_contains_null (bool) –
- class deltalake.schema.Schema(fields, json_value)
Create a DeltaTable Schema instance.
- Parameters
fields (List[deltalake.schema.Field]) –
json_value (Dict[str, Any]) –
- Return type
None
- classmethod from_json(json_data)
Generate a DeltaTable Schema from a json format.
- Parameters
json_data (str) – the schema in json format
- Returns
the DeltaTable schema
- Return type
- class deltalake.schema.StructType(fields)
Concrete class for struct data types.
- Parameters
fields (List[deltalake.schema.Field]) –
- deltalake.schema.pyarrow_datatype_from_dict(json_dict)
Create a DataType in PyArrow format from a Schema json format.
- Parameters
json_dict (Dict[str, Any]) – the DataType in json format
- Returns
the DataType in PyArrow format
- Return type
- deltalake.schema.pyarrow_field_from_dict(field)
Create a Field in PyArrow format from a Field in json format. :param field: the field in json format :return: the Field in PyArrow format
- Parameters
field (Dict[str, Any]) –
- Return type
- deltalake.schema.pyarrow_schema_from_json(json_data)
Create a Schema in PyArrow format from a Schema in json format.
- Parameters
json_data (str) – the field in json format
- Returns
the Schema in PyArrow format
- Return type
DataCatalog
- class deltalake.data_catalog.DataCatalog(value)
List of the Data Catalogs
DeltaStorageHandler
- class deltalake.fs.DeltaStorageHandler(table_uri)
DeltaStorageHander is a concrete implementations of a PyArrow FileSystemHandler.
- Parameters
table_uri (str) –
- Return type
None
- copy_file(src, dest)
Copy a file.
If the destination exists and is a directory, an error is returned. Otherwise, it is replaced.
- Parameters
src (str) – The path of the file to be copied from.
dest (str) – The destination path where the file is copied to.
- Return type
None
- create_dir(path, *, recursive=True)
Create a directory and subdirectories.
This function succeeds if the directory already exists.
- Parameters
path (str) – The path of the new directory.
recursive (bool) – Create nested directories as well.
- Return type
None
- delete_dir(path)
Delete a directory and its contents, recursively.
- Parameters
path (str) – The path of the directory to be deleted.
- Return type
None
- delete_dir_contents(path)
Delete a directory’s contents, recursively.
Like delete_dir, but doesn’t delete the directory itself.
- Parameters
path (str) – The path of the directory to be deleted.
- Return type
None
- delete_file(path)
Delete a file.
- Parameters
path (str) – The path of the file to be deleted.
- Return type
None
- delete_root_dir_contents()
Delete a directory’s contents, recursively.
Like delete_dir_contents, but for the root directory (path is empty or “/”)
- Return type
None
- get_file_info(paths)
Get info for the given files.
- Parameters
paths (List[str]) – List of file paths
- Returns
list of file info objects
- Return type
List[pyarrow._fs.FileInfo]
- get_file_info_selector(selector)
Get info for the files defined by FileSelector.
- Parameters
selector (pyarrow._fs.FileSelector) – FileSelector object
- Returns
list of file info objects
- Return type
List[pyarrow._fs.FileInfo]
- get_type_name()
The filesystem’s type name.
- Returns
The filesystem’s type name.
- Return type
str
- move(src, dest)
Move / rename a file or directory.
If the destination exists: - if it is a non-empty directory, an error is returned - otherwise, if it has the same type as the source, it is replaced - otherwise, behavior is unspecified (implementation-dependent).
- Parameters
src (str) – The path of the file or the directory to be moved.
dest (str) – The destination path where the file or directory is moved to.
- Return type
None
- normalize_path(path)
Normalize filesystem path.
- Parameters
path (str) – the path to normalize
- Returns
the normalized path
- Return type
str
- open_append_stream(path, metadata=None)
DEPRECATED: Open an output stream for appending.
If the target doesn’t exist, a new empty file is created.
- Parameters
path (str) – The source to open for writing.
metadata (Optional[Dict[str, Any]]) – If not None, a mapping of string keys to string values.
- Returns
NativeFile
- Return type
- open_input_file(path)
Open an input file for random access reading.
- Parameters
source – The source to open for reading.
path (str) –
- Returns
NativeFile
- Return type
- open_input_stream(path)
Open an input stream for sequential reading.
- Parameters
source – The source to open for reading.
path (str) –
- Returns
NativeFile
- Return type
- open_output_stream(path, metadata=None)
Open an output stream for sequential writing.
If the target already exists, existing data is truncated.
- Parameters
path (str) – The source to open for writing.
metadata (Optional[Dict[str, Any]]) – If not None, a mapping of string keys to string values.
- Returns
NativeFile
- Return type