AWS also had customers that had petabytes of data in Redshift for analysis. The conversation is missing a key point: DuckDB is optimizing for a different class of use cases. They’re optimizing for data science and not traditional data warehousing use cases. It’s masquerading as size. Even for small sizes, there are other considerations: access control, concurrency control, reliability, availability, and so on. The requirements are different for those different use cases. Data science tends to be single user, local, and lower availability requirements than warehouses that serve production pipelines, data sharing, and so on. I also think that DuckDB can be used for those, but not optimized for those.
>..here is a small number of tables in Redshift with
trillions of rows, while the majority is much more reasonably sized
with only millions of rows. In fact, most tables have less than a
million rows and the vast majority (98 %) has less than a billion
rows.
The argument can be made that 98% of people using redshift can potentially get by with DuckDB.
Data size is a red herring in the conversation.