pg_parquet：连接 Postgres 和 Parquet 的扩展

2024-10-18 4K banq

今天，我们很高兴发布pg_parquet - 一个用于处理 Parquet 文件的开源 Postgres 扩展。该扩展可以从 Postgres 本地读取 parquet 文件并将其写入本地磁盘或 S3。使用 pg_parquet，您可以：

将表或查询从 Postgres 导出到 Parquet 文件
将数据从 Parquet 文件导入到 Postgres
检查现有 Parquet 文件的架构和元数据

代码位于：https: //github.com/CrunchyData/pg_parquet/。

为什么选择 pg_parquet？
Parquet 是一种出色的列式文件格式，可提供高效的数据压缩。在系统之间共享数据时，使用 Parquet 格式的数据非常有意义。您可能要存档较旧的数据，或者存档适合分析而非事务工作负载的格式。虽然有很多工具可以使用 Parquet，但 Postgres 用户只能自己摸索。现在，借助 pg_parquet，Postgres 和 Parquet 可以轻松原生地协同工作。更好的是，您可以使用 Parquet，而无需维护另一个数据管道。

等等，什么是 Parquet？ Apache Parquet 是一种开源、标准、面向列的文件格式，诞生于大数据的 Hadoop 时代。Parquet 使用文件以针对 SQL 查询进行优化的方式存储数据。在数据湖的世界里，Parquet 无处不在。

使用 pg_parquet
通过扩展 Postgrescopy命令，我们能够在本地服务器或 s3 中高效地将数据复制到 Parquet 或从 Parquet 复制数据。

-- Copy a query result into a Parquet file on the postgres server
COPY (SELECT * FROM table) TO '/tmp/data.parquet' WITH (format 'parquet');

-- Copy a query result into Parquet in S3
COPY (SELECT * FROM table) TO 's3://mybucket/data.parquet' WITH (format 'parquet');

-- Load data from Parquet in S3
COPY table FROM 's3://mybucket/data.parquet' WITH (format 'parquet');
让我们以一个产品表为例，但不仅仅是一个基本版本，它具有复合 Postgres 类型和数组：

-- create composite types
CREATE TYPE product_item AS (id INT, name TEXT, price float4);
CREATE TYPE product AS (id INT, name TEXT, items product_item[]);

-- create a table with complex types
CREATE TABLE product_example (
    id int,
    product product,
    products product[],
    created_at TIMESTAMP,
    updated_at TIMESTAMPTZ
);

-- insert some rows into the table
INSERT INTO product_example values (
    1,
    ROW(1, 'product 1', ARRAY[ROW(1, 'item 1', 1.0), ROW(2, 'item 2', 2.0), NULL]::product_item[])::product,
    ARRAY[ROW(1, NULL, NULL)::product, NULL],
    now(),
    '2022-05-01 12:00:00-04'
);

-- copy the table to a parquet file
COPY product_example TO '/tmp/product_example.parquet' (format 'parquet', compression 'gzip');

-- copy the parquet file to the table
COPY product_example FROM '/tmp/product_example.parquet';

-- show table
SELECT * FROM product_example;

检查 Parquet 文件
除了将数据复制进或复制出 parquet 之外，您还可以探索现有的 Parquet 文件以开始了解它们的结构。

-- Describe a parquet schema
SELECT name, type_name, logical_type, field_id
FROM parquet.schema('s3://mybucket/data.parquet');
┌──────────────┬────────────┬──────────────┬──────────┐
│     name     │ type_name  │ logical_type │ field_id │
├──────────────┼────────────┼──────────────┼──────────┤
│ arrow_schema │            │              │          │
│ name         │ BYTE_ARRAY │ STRING       │        0 │
│ s            │ INT32      │              │        1 │
└──────────────┴────────────┴──────────────┴──────────┘
(3 rows)

-- Retrieve parquet detailed metadata including column statistics
SELECT row_group_id, column_id, row_group_num_rows, row_group_bytes
FROM parquet.metadata('s3://mybucket/data.parquet');
┌──────────────┬───────────┬────────────────────┬─────────────────┐
│ row_group_id │ column_id │ row_group_num_rows │ row_group_bytes │
├──────────────┼───────────┼────────────────────┼─────────────────┤
│            0 │         0 │                100 │             622 │
│            0 │         1 │                100 │             622 │
└──────────────┴───────────┴────────────────────┴─────────────────┘
(2 rows)

-- Retrieve parquet file metadata such as the total number of rows
SELECT created_by, num_rows, format_version
FROM parquet.file_metadata('s3://mybucket/data.parquet');
┌────────────┬──────────┬────────────────┐
│ created_by │ num_rows │ format_version │
├────────────┼──────────┼────────────────┤
│ pg_parquet │      100 │ 1              │
└────────────┴──────────┴────────────────┘
(1 row)

Postgres 长期以来一直被信任用于处理事务工作负载，但我们相信在不久的将来，它将同样有能力进行分析。我们很高兴发布pg_parquet，这是让 Postgres 成为您唯一需要的数据库的又一步。