@Evolving public interface ParquetHandler
Modifier and Type | Method and Description |
---|---|
CloseableIterator<ColumnarBatch> |
readParquetFiles(CloseableIterator<FileStatus> fileIter,
StructType physicalSchema,
java.util.Optional<Predicate> predicate)
Read the Parquet format files at the given locations and return the data as a
ColumnarBatch with the columns requested by physicalSchema . |
void |
writeParquetFileAtomically(String filePath,
CloseableIterator<FilteredColumnarBatch> data)
Write the given data as a Parquet file.
|
CloseableIterator<DataFileStatus> |
writeParquetFiles(String directoryPath,
CloseableIterator<FilteredColumnarBatch> dataIter,
java.util.List<Column> statsColumns)
Write the given data batches to a Parquet files.
|
CloseableIterator<ColumnarBatch> readParquetFiles(CloseableIterator<FileStatus> fileIter, StructType physicalSchema, java.util.Optional<Predicate> predicate) throws java.io.IOException
ColumnarBatch
with the columns requested by physicalSchema
.
If physicalSchema
has a StructField
with column name
StructField.METADATA_ROW_INDEX_COLUMN_NAME
and the field is a metadata column
StructField.isMetadataColumn()
the column must be populated with the file row index.
How does a column in physicalSchema
match to the column in the Parquet file? If the
StructField
has a field id in the metadata
with key `parquet.field.id` the
column is attempted to match by id. If the column is not found by id, the column is matched
by name. When trying to find the column in Parquet by name, first case-sensitive match is
used. If not found then a case-insensitive match is attempted.
fileIter
- Iterator of files to read data from.physicalSchema
- Select list of columns to read from the Parquet file.predicate
- Optional predicate which the Parquet reader can optionally use to
prune rows that don't satisfy the predicate. Because pruning is
optional and may be incomplete, caller is still responsible apply the
predicate on the data returned by this method.ColumnarBatch
s containing the data in columnar format. It is
the responsibility of the caller to close the iterator. The data returned is in the same as
the order of files given in scanFileIter
.java.io.IOException
- if an I/O error occurs during the read.CloseableIterator<DataFileStatus> writeParquetFiles(String directoryPath, CloseableIterator<FilteredColumnarBatch> dataIter, java.util.List<Column> statsColumns) throws java.io.IOException
directoryPath
- Location where the data files should be written.dataIter
- Iterator of data batches to write. It is the responsibility of the calle
to close the iterator.statsColumns
- List of columns to collect statistics for. The statistics collection is
optional. If the implementation does not support statistics collection,
it is ok to return no statistics.DataFileStatus
containing the status of the written files.
Each status contains the file path and the optionally collected statistics for the file
It is the responsibility of the caller to close the iterator.java.io.IOException
- if an I/O error occurs during the file writing. This may leave some files
already written in the directory. It is the responsibility of the caller
to clean up.void writeParquetFileAtomically(String filePath, CloseableIterator<FilteredColumnarBatch> data) throws java.io.IOException
filePath
- Fully qualified destination file pathdata
- Iterator of FilteredColumnarBatch
java.nio.file.FileAlreadyExistsException
- if the file already exists and overwrite
is false.java.io.IOException
- if any other I/O error occurs.