DuckDb has a new "DuckLake" catalog format that would be another candidate to te...

sukhavati · 2025-11-14T11:40:11 1763120411

for me the issue is that DuckLake's feature of flushing inlined data to parquet is still in alpha. one of the main issues with parquet is when writing small batches you end up with a lot of parquet files that are inefficient to work with using duckdb. to solve this ducklake inlines these small writes to the dbms you choose (postgres) but for a while it couldn't write them back to parquet. last I had checked this feature didn't yet exist, and now it seems to be in alpha which is nice to see, but I'd like some better support before I consider switching some personal data projects over. https://ducklake.select/docs/stable/duckdb/advanced_features...

erikcw · 2025-11-14T15:54:35 1763135675

Data inlining is also currently limited to only the DuckDB catalog (ie it doesn't work with Postgres cataglogs)[0]. It's improving very quickly though and I'm sure this will be expanded soon.

[0] https://ducklake.select/docs/stable/duckdb/advanced_features...

garganzol · 2025-11-14T01:52:01 1763085121

DuckLake format has an unresolved built-in chicken and egg conflict: it requires SQL database to represent its catalog. But this is what some people are running away from when they choose Parquet format in the first place. Parquet = easy, SQL = hard, adding SQL to Parquet makes the resulting format hard. I would expect a catalog to be in Parquet format as well, then it becomes something self-bootstrapping and usable.

datacynic · 2025-11-14T10:32:13 1763116333

DuckLake is more comparable to Iceberg and Delta than to raw parquet files. Iceberg requires a catalog layer too, a file system based one at its simplest. For DuckLake any RDBMS will do, including fs-based ones like DuckDB and SQLite. The difference is that DuckLake will use that database with all its ACID goodness for all metadata operations and there is no need to implement transactional semantics over a REST or object storage API.

matt123456789 · 2025-11-14T02:49:47 1763088587

It is not a chicken and egg problem, it is just a requirement to have an RDBMS available for systems like DuckLake and Hive to store their catalogs in. Metadata is relatively small and needs to provide ACID r/w => great RDBMS use case.

dsp_person · 2025-11-14T04:02:09 1763092929

What about file-based catalogs with Iceberg? Found one that puts it in a single json file: https://github.com/boringdata/boring-catalog

saxenaabhi · 2025-11-14T04:54:14 1763096054

Then concurrency suffers since you have to have locks when you update files.

That's also why ducklake performs better than others.

For many use cases this trade-off is worth it.