It is becoming clearer every day how Synthetic Data Generation will be a must-have skill and technology in the upcoming years!
I'm the lead developer of an Open Source project called SDV (The Synthetic Data Vault)[0] which offers an ecosystem of Python libraries and resources for Synthetic Data Generation of different data modalities. It allows learning and sampling synthetic clones of Single Table, Relational and Time Series datasets, and offers tools to evaluate the quality of the generated data and benchmark different models.
Ex-NSA, like Google. They do everything to keep your data safe. And do everything to keep the profiling they do based on your data (= their data) safe.
"Israel has their own version of the NSA called Unit 8200. Listen along to learn what someone gets into the group, what they do, and what these members do once they leave the unit." [1]
> Gretel wants to change that by making it faster and easier to anonymize data sets.
That's nothing like 'GitHub for data'. I was expecting a high level collab tool built on the 'dat' protocol (which is how versioning is one for datasets).
Yep- versioning is definitely important, and not what Gretel focuses on. You could connect a Gretel project stream up to a Dat backend for versioning/lineage.
So you could use Gretel to anonymize or build a synthetic version of a dataset for sharing, and then use Dat for versioning
CEO of DoltHub here. Thanks for the shout out. The key here is that we also built an open source SQL database you can branch and merge. It's the Git in the GitHub.
We're building something similar at Splitgraph [0], although we actually try to avoid reductive descriptors like "Git for data" and "GitHub for data." Forcing the Git workflow "for data" is the wrong approach, and we don't believe it's what analytics engineers actually want to use.
Our solution is a "data delivery network" [1] that looks like a big Postgres database with 40,000+ datasets in it. So you can connect to it [2] with any SQL client that is compatible with the Postgres wire protocol. You can query individual datasets, or join across them, as if they were tables in a Postgres database. Those datasets can be "live" (in which case we forward queries to the upstream data source, e.g. a government data portal), or they can be versioned snapshots called "data images." Data images are a lot like Docker images, and you can build them with declarative recipes called Splitfiles, all using our open source tool. [3] Then you can push the images to Splitgraph and query them alongside all the other data on the platform.
The public platform is mostly community data right now. Soon you'll be able to connect arbitrary upstreams to it (e.g. connect directly to a Mongo/Postgres/MySQL database by providing its read-only credentials in the web UI). As an example, we have a repository that forwards queries to the OxCOVID19 database, which is a Postgres database maintained by a lab at Oxford. [4]
We're also building an enterprise program for hosting private deployments of Splitgraph. We have a pilot program where customers can use Splitgraph as a data cataloging and governance solution for their internal company data. We've also raised a round of funding but haven't announced it yet. If anyone is interested in joining the pilot program, get in touch. You can also read more details about how we dogfood Splitgraph in our own analytics stack. [5]
I'm the lead developer of an Open Source project called SDV (The Synthetic Data Vault)[0] which offers an ecosystem of Python libraries and resources for Synthetic Data Generation of different data modalities. It allows learning and sampling synthetic clones of Single Table, Relational and Time Series datasets, and offers tools to evaluate the quality of the generated data and benchmark different models.
[0] https://github.com/sdv-dev/SDV