Dolt is Git for data. Instead of versioning files, Dolt versions tables. DoltHub is a place on the internet to share Dolt repositories. As far as we can tell, Dolt is the only database with branches. How would you use such a thing?
One of the hard things about getting adoption of Dolt is that it is a generally useful tool, not a specifically useful tool. In other words, Dolt can be used for a number of different tasks. This is a strength long-term but a bit of a detriment short-term. We need to find the use case that drives Dolt and DoltHub adoption now or we won't be able to stick around long enough to see Dolt and DoltHub in action for all the use cases we can imagine.
This document describes some of our ideas about how Dolt can be used. We've ordered the document from most to least ready. Towards the bottom of the list, Dolt is missing some functionality that must be built. The list is not exhaustive. If you have a killer use case, please let me know at tim@liquidata.co.
Sharing Data on the Internet
Sharing data on the Internet was the guiding use case for which we built the current iteration of Dolt and DoltHub. We set out to build a fundamentally better format than CSV, JSON, or API for distributed data collaboration. In these formats, every write you do to data you receive is a hard fork in version control lingo. With Dolt, you can write to data you get off the internet and still be able to merge updates easily. You can even give these writes back to the producer, allowing deep collaboration between data producer and data consumer.
Collectively, we spend a lot of code taking data out of a database, putting it in a format for sharing, sharing it, and then putting it back in a database format for consumption. Dolt allows you to instead share the database, including schema and views, allowing you to delete all the code used to transfer data.
DoltHub is a beautiful interface for exploring data. DoltHub allows you to "try before you buy" data. You can run SQL queries on the web to see if the data matches your needs. The data provider can even build sample queries to guide the consumer's exploration. Via the commit log, you can see how often the data is updated. You can see who changed the data and why.
The data you share can be private. For a small fee, you can host private datasets and add read, write, or admin collaborators. Work with a distributed team to build a great dataset for you all to use.
We spend about 20% of our time as a company curating, gathering, and cleaning data to show off this use case. You can see all the public datasets on the DoltHub Discover page. Add your data to the collection and start being part of our data community.
Ingesting Data You Do Not Own
Are you using data you don't own in your project? When your provider sends you a new copy of the data, do things break? We suspect the answer is yes. The first thing you should do when you ingest data you don't own is put it in Dolt.
With Dolt, you can view a human-readable diff of the data you received last time versus the data you received this time. You can easily see updates you did not expect and fix the problem before you deploy the new data. If something broke when you started using the data, you can easily switch back to the previous version and start debugging from the diff. Via DoltHub webhooksyou can attach your continuous integration suite to the data and run tests whenever you receive new data.
You can manage feeds of different quality that have the same schema using branches. We illustrated a good example of how to do that with Coronavirus case details data. By adopting Dolt, using external data in your organization becomes an order of magnitude easier and will cause fewer production issues.
Versioning Data Lake Query Output
Most businesses have scheduled multi-hour data jobs that produce output tables. For instance, you may have a job that produces business metrics for yesterday. We suggest you version those output tables in Dolt. If you notice a problem, you'll be able to see the diff between yesterday and today's run. Or even better, if three days later, someone notices an issue that started 7 days ago, you'll be able to produce a diff between those arbitrary dates as well.
If a change to the upstream data or data processing code is required, just move the bad reports to a separate branch, rerun the jobs, and look at the diffs to make sure you fixed the problem. Versioning really helps in these scenarios.
This also turns these tables from read-only to writable, enabling unique capabilities for your data science team. Produce daily reports using your pipelines and distribute the results as writable tables to your data science team, allowing them to comment on or label results. Dolt provides an audit trail to see who wrote what, when, and why.
Reproducing Models or Analysis
Have you ever had a hard time tracking down what version of the data generated that model that works so well? Is someone asking you to reproduce an analysis with data from a week ago? Dolt can help you here.
Dolt has the concept of commits. Commits mark a dataset at a point in time and it is really simple to switch between commits. If you produce a model or analysis, make a commit of the data at that point of time and note the commit in the documentation of the model or analysis. When you or someone else returns to that model in the future, getting the data as it looked then is as simple as checking out the commit.
If your data is labeled, do you ever want to see if a particular labeler, machine or human, has a particular bias? Inspect the commit history of any cell in the Dolt database to see who changed it and why. Use branches to try different labeling or noise strategies with the ability to easily compare branches. Build your models off branches for reproducibility of every model.
Better Database Backups
Imagine the ability to diff between yesterday's and today's database backup. Run continuous integration tests on your backups to make sure everything is working. Distribute a copy of the backup to analysts so they don't have to run queries on the production database. Push a copy to DoltHub to get the safety of distributed replication. You can do all of this if you use Dolt for database backups.
Right now, you would have to make a dump of your database and import it into Dolt. In the near future, you will be able to set up a Dolt instance as a replicator of your database with a configurable commit strategy. Do you want to commit after every write, every night, or after every one million writes? With Dolt, your backups are actually useful, not just insurance against a disaster scenario.
Add Data Versioning to an Application
Do you have an application that would be better if it provided data versioning? Dolt could eventually back that application. We are a little ways off before we would recommend using Dolt for this use case. We need replication, concurrency, hooks to generate commits, and a host of other things. We're getting there but it will be a few months.
Try Dolt Today
As you can see, Dolt and DoltHub can be used for a number of different tasks. Do any of these use cases resonate? If so, try Dolt today. We are early on this adventure. Come be part of it with us.
from Hacker News https://ift.tt/3417JEz
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.