Gretel is building the GitHub for Data

csala · on Nov 17, 2020

It is becoming clearer every day how Synthetic Data Generation will be a must-have skill and technology in the upcoming years!

I'm the lead developer of an Open Source project called SDV (The Synthetic Data Vault)[0] which offers an ecosystem of Python libraries and resources for Synthetic Data Generation of different data modalities. It allows learning and sampling synthetic clones of Single Table, Relational and Time Series datasets, and offers tools to evaluate the quality of the generated data and benchmark different models.

[0] https://github.com/sdv-dev/SDV

stmos42 · on Nov 16, 2020

This is the company started by ex-NSA and Amazon guys, right? Think TechCrunch ran an article earlier this year too https://techcrunch.com/2020/02/20/gretel-nsa-amazon-github-d...

watson1008 · on Nov 16, 2020

I don't know who needs to hear this, but I don't trust NSA staffers to keep my data private. Ha!!!

Fnoord · on Nov 16, 2020

Ex-NSA, like Google. They do everything to keep your data safe. And do everything to keep the profiling they do based on your data (= their data) safe.

"Israel has their own version of the NSA called Unit 8200. Listen along to learn what someone gets into the group, what they do, and what these members do once they leave the unit." [1]

TL;DR/spoiler: start a startup.

[1] https://darknetdiaries.com/episode/28/

stmos42 · on Nov 16, 2020

Ha, Yeah, They need to use end to end encryption

findingn3mo · on Nov 16, 2020

e2e encryption solves all problems

nailer · on Nov 16, 2020

> Gretel wants to change that by making it faster and easier to anonymize data sets.

That's nothing like 'GitHub for data'. I was expecting a high level collab tool built on the 'dat' protocol (which is how versioning is one for datasets).

findingn3mo · on Nov 16, 2020

Their beta web interface and web APIs are actually a lot like Github, but you share access datastreams and not code. https://gretel-client.readthedocs.io/en/stable/?badge=stable

alexwatson405 · on Nov 16, 2020

Yep- versioning is definitely important, and not what Gretel focuses on. You could connect a Gretel project stream up to a Dat backend for versioning/lineage.

So you could use Gretel to anonymize or build a synthetic version of a dataset for sharing, and then use Dat for versioning

kalyanv · on Nov 17, 2020

Yes, I think it's more about Synthetic Data Generation than versioning. Much closer to what SDV [0] does.

[0] https://github.com/sdv-dev/SDV

maest · on Nov 16, 2020

Another github for data:

https://www.dolthub.com/

timsehn · on Nov 16, 2020

CEO of DoltHub here. Thanks for the shout out. The key here is that we also built an open source SQL database you can branch and merge. It's the Git in the GitHub.

chatmasta · on Nov 16, 2020

We're building something similar at Splitgraph [0], although we actually try to avoid reductive descriptors like "Git for data" and "GitHub for data." Forcing the Git workflow "for data" is the wrong approach, and we don't believe it's what analytics engineers actually want to use.

Our solution is a "data delivery network" [1] that looks like a big Postgres database with 40,000+ datasets in it. So you can connect to it [2] with any SQL client that is compatible with the Postgres wire protocol. You can query individual datasets, or join across them, as if they were tables in a Postgres database. Those datasets can be "live" (in which case we forward queries to the upstream data source, e.g. a government data portal), or they can be versioned snapshots called "data images." Data images are a lot like Docker images, and you can build them with declarative recipes called Splitfiles, all using our open source tool. [3] Then you can push the images to Splitgraph and query them alongside all the other data on the platform.

The public platform is mostly community data right now. Soon you'll be able to connect arbitrary upstreams to it (e.g. connect directly to a Mongo/Postgres/MySQL database by providing its read-only credentials in the web UI). As an example, we have a repository that forwards queries to the OxCOVID19 database, which is a Postgres database maintained by a lab at Oxford. [4]

We're also building an enterprise program for hosting private deployments of Splitgraph. We have a pilot program where customers can use Splitgraph as a data cataloging and governance solution for their internal company data. We've also raised a round of funding but haven't announced it yet. If anyone is interested in joining the pilot program, get in touch. You can also read more details about how we dogfood Splitgraph in our own analytics stack. [5]

[0] https://www.splitgraph.com/

[1] https://www.splitgraph.com/blog/data-delivery-network-launch

[2] https://www.splitgraph.com/connect

[3] https://www.github.com/splitgraph/splitgraph

[4] https://www.splitgraph.com/splitgraph/oxcovid19

[5] https://www.splitgraph.com/blog/splitgraph-matomo-elasticsea...