Writer
Write to Delta Tables
deltalake.write_deltalake
write_deltalake(table_or_uri: Union[str, Path, DeltaTable], data: Union[pd.DataFrame, ds.Dataset, pa.Table, pa.RecordBatch, Iterable[pa.RecordBatch], RecordBatchReader, ArrowStreamExportable], *, schema: Optional[Union[pa.Schema, DeltaSchema]] = None, partition_by: Optional[Union[List[str], str]] = None, mode: Literal['error', 'append', 'overwrite', 'ignore'] = 'error', file_options: Optional[ds.ParquetFileWriteOptions] = None, max_partitions: Optional[int] = None, max_open_files: int = 1024, max_rows_per_file: int = 10 * 1024 * 1024, min_rows_per_group: int = 64 * 1024, max_rows_per_group: int = 128 * 1024, name: Optional[str] = None, description: Optional[str] = None, configuration: Optional[Mapping[str, Optional[str]]] = None, schema_mode: Optional[Literal['merge', 'overwrite']] = None, storage_options: Optional[Dict[str, str]] = None, partition_filters: Optional[List[Tuple[str, str, Any]]] = None, predicate: Optional[str] = None, target_file_size: Optional[int] = None, large_dtypes: bool = False, engine: Literal['pyarrow', 'rust'] = 'rust', writer_properties: Optional[WriterProperties] = None, custom_metadata: Optional[Dict[str, str]] = None, post_commithook_properties: Optional[PostCommitHookProperties] = None, commit_properties: Optional[CommitProperties] = None) -> None
Write to a Delta Lake table
If the table does not already exist, it will be created.
The pyarrow writer supports protocol version 2 currently and won't be updated. For higher protocol support use engine='rust', this will become the default eventually.
To enable safe concurrent writes when writing to S3, an additional locking mechanism must be supplied. For more information on enabling concurrent writing to S3, follow this guide
Parameters:
Name | Type | Description | Default |
---|---|---|---|
table_or_uri |
Union[str, Path, DeltaTable]
|
URI of a table or a DeltaTable object. |
required |
data |
Union[DataFrame, Dataset, Table, RecordBatch, Iterable[RecordBatch], RecordBatchReader, ArrowStreamExportable]
|
Data to write. If passing iterable, the schema must also be given. |
required |
schema |
Optional[Union[Schema, Schema]]
|
Optional schema to write. |
None
|
partition_by |
Optional[Union[List[str], str]]
|
List of columns to partition the table by. Only required when creating a new table. |
None
|
mode |
Literal['error', 'append', 'overwrite', 'ignore']
|
How to handle existing data. Default is to error if table already exists. If 'append', will add new data. If 'overwrite', will replace table with new data. If 'ignore', will not write anything if table already exists. |
'error'
|
file_options |
Optional[ParquetFileWriteOptions]
|
Optional write options for Parquet (ParquetFileWriteOptions). Can be provided with defaults using ParquetFileWriteOptions().make_write_options(). Please refer to https://github.com/apache/arrow/blob/master/python/pyarrow/_dataset_parquet.pyx#L492-L533 for the list of available options. Only used in pyarrow engine. |
None
|
max_partitions |
Optional[int]
|
the maximum number of partitions that will be used. Only used in pyarrow engine. |
None
|
max_open_files |
int
|
Limits the maximum number of files that can be left open while writing. If an attempt is made to open too many files then the least recently used file will be closed. If this setting is set too low you may end up fragmenting your data into many small files. Only used in pyarrow engine. |
1024
|
max_rows_per_file |
int
|
Maximum number of rows per file. If greater than 0 then this will limit how many rows are placed in any single file. Otherwise there will be no limit and one file will be created in each output directory unless files need to be closed to respect max_open_files min_rows_per_group: Minimum number of rows per group. When the value is set, the dataset writer will batch incoming data and only write the row groups to the disk when sufficient rows have accumulated. Only used in pyarrow engine. |
10 * 1024 * 1024
|
max_rows_per_group |
int
|
Maximum number of rows per group. If the value is set, then the dataset writer may split up large incoming batches into multiple row groups. If this value is set, then min_rows_per_group should also be set. |
128 * 1024
|
name |
Optional[str]
|
User-provided identifier for this table. |
None
|
description |
Optional[str]
|
User-provided description for this table. |
None
|
configuration |
Optional[Mapping[str, Optional[str]]]
|
A map containing configuration options for the metadata action. |
None
|
schema_mode |
Optional[Literal['merge', 'overwrite']]
|
If set to "overwrite", allows replacing the schema of the table. Set to "merge" to merge with existing schema. |
None
|
storage_options |
Optional[Dict[str, str]]
|
options passed to the native delta filesystem. |
None
|
predicate |
Optional[str]
|
When using |
None
|
target_file_size |
Optional[int]
|
Override for target file size for data files written to the delta table. If not passed, it's taken from |
None
|
partition_filters |
Optional[List[Tuple[str, str, Any]]]
|
the partition filters that will be used for partition overwrite. Only used in pyarrow engine. |
None
|
large_dtypes |
bool
|
Only used for pyarrow engine |
False
|
engine |
Literal['pyarrow', 'rust']
|
writer engine to write the delta table. PyArrow engine is deprecated, and will be removed in v1.0. |
'rust'
|
writer_properties |
Optional[WriterProperties]
|
Pass writer properties to the Rust parquet writer. |
None
|
custom_metadata |
Optional[Dict[str, str]]
|
Deprecated and will be removed in future versions. Use commit_properties instead. |
None
|
post_commithook_properties |
Optional[PostCommitHookProperties]
|
properties for the post commit hook. If None, default values are used. |
None
|
commit_properties |
Optional[CommitProperties]
|
properties of the transaction commit. If None, default values are used. |
None
|
deltalake.BloomFilterProperties
dataclass
BloomFilterProperties(set_bloom_filter_enabled: Optional[bool], fpp: Optional[float] = None, ndv: Optional[int] = None)
The Bloom Filter Properties instance for the Rust parquet writer.
Create a Bloom Filter Properties instance for the Rust parquet writer:
Parameters:
Name | Type | Description | Default |
---|---|---|---|
set_bloom_filter_enabled |
Optional[bool]
|
If True and no fpp or ndv are provided, the default values will be used. |
required |
fpp |
Optional[float]
|
The false positive probability for the bloom filter. Must be between 0 and 1 exclusive. |
None
|
ndv |
Optional[int]
|
The number of distinct values for the bloom filter. |
None
|
deltalake.ColumnProperties
dataclass
ColumnProperties(dictionary_enabled: Optional[bool] = None, max_statistics_size: Optional[int] = None, bloom_filter_properties: Optional[BloomFilterProperties] = None)
The Column Properties instance for the Rust parquet writer.
Create a Column Properties instance for the Rust parquet writer:
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dictionary_enabled |
Optional[bool]
|
Enable dictionary encoding for the column. |
None
|
max_statistics_size |
Optional[int]
|
Maximum size of statistics for the column. |
None
|
bloom_filter_properties |
Optional[BloomFilterProperties]
|
Bloom Filter Properties for the column. |
None
|
deltalake.WriterProperties
dataclass
WriterProperties(data_page_size_limit: Optional[int] = None, dictionary_page_size_limit: Optional[int] = None, data_page_row_count_limit: Optional[int] = None, write_batch_size: Optional[int] = None, max_row_group_size: Optional[int] = None, compression: Optional[Literal['UNCOMPRESSED', 'SNAPPY', 'GZIP', 'BROTLI', 'LZ4', 'ZSTD', 'LZ4_RAW']] = None, compression_level: Optional[int] = None, statistics_truncate_length: Optional[int] = None, default_column_properties: Optional[ColumnProperties] = None, column_properties: Optional[Dict[str, ColumnProperties]] = None)
A Writer Properties instance for the Rust parquet writer.
Create a Writer Properties instance for the Rust parquet writer:
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data_page_size_limit |
Optional[int]
|
Limit DataPage size to this in bytes. |
None
|
dictionary_page_size_limit |
Optional[int]
|
Limit the size of each DataPage to store dicts to this amount in bytes. |
None
|
data_page_row_count_limit |
Optional[int]
|
Limit the number of rows in each DataPage. |
None
|
write_batch_size |
Optional[int]
|
Splits internally to smaller batch size. |
None
|
max_row_group_size |
Optional[int]
|
Max number of rows in row group. |
None
|
compression |
Optional[Literal['UNCOMPRESSED', 'SNAPPY', 'GZIP', 'BROTLI', 'LZ4', 'ZSTD', 'LZ4_RAW']]
|
compression type. |
None
|
compression_level |
Optional[int]
|
If none and compression has a level, the default level will be used, only relevant for GZIP: levels (1-9), BROTLI: levels (1-11), ZSTD: levels (1-22), |
None
|
statistics_truncate_length |
Optional[int]
|
maximum length of truncated min/max values in statistics. |
None
|
default_column_properties |
Optional[ColumnProperties]
|
Default Column Properties for the Rust parquet writer. |
None
|
column_properties |
Optional[Dict[str, ColumnProperties]]
|
Column Properties for the Rust parquet writer. |
None
|
Convert to Delta Tables
deltalake.convert_to_deltalake
convert_to_deltalake(uri: Union[str, Path], mode: Literal['error', 'ignore'] = 'error', partition_by: Optional[pa.Schema] = None, partition_strategy: Optional[Literal['hive']] = None, name: Optional[str] = None, description: Optional[str] = None, configuration: Optional[Mapping[str, Optional[str]]] = None, storage_options: Optional[Dict[str, str]] = None, custom_metadata: Optional[Dict[str, str]] = None) -> None
Convert
parquet tables to delta
tables.
Currently only HIVE partitioned tables are supported. Convert to delta
creates
a transaction log commit with add actions, and additional properties provided such
as configuration, name, and description.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
uri |
Union[str, Path]
|
URI of a table. |
required |
partition_by |
Optional[Schema]
|
Optional partitioning schema if table is partitioned. |
None
|
partition_strategy |
Optional[Literal['hive']]
|
Optional partition strategy to read and convert |
None
|
mode |
Literal['error', 'ignore']
|
How to handle existing data. Default is to error if table already exists. If 'ignore', will not convert anything if table already exists. |
'error'
|
name |
Optional[str]
|
User-provided identifier for this table. |
None
|
description |
Optional[str]
|
User-provided description for this table. |
None
|
configuration |
Optional[Mapping[str, Optional[str]]]
|
A map containing configuration options for the metadata action. |
None
|
storage_options |
Optional[Dict[str, str]]
|
options passed to the native delta filesystem. Unused if 'filesystem' is defined. |
None
|
custom_metadata |
Optional[Dict[str, str]]
|
custom metadata that will be added to the transaction commit |
None
|