Fun with the spawn of Git and NoSQL
26 October 2017
Hey, kids, check out the latest progress on the Attaca version control system.
What's this? It's basically the spawn of Git and a NoSQL database. So why would anybody want to make that? For Science, of course. A lot of research produces huge data files, and people would like to have a resilient way to collaborate on them, using commands they already know—but have it scale horizontally across large numbers of nodes, NoSQL style.
Git has the advantage that a lot of people know it, but it doesn't really handle huge files that well. There are add-on solutions to make it work by connecting to another system for handling large files, but then you have to set up and trust two systems. And one of my favorite properties of Git is that any authorized user of a project can check the integrity of the entire project back to the beginning.
So what Attaca does is to consistently split huge files across a cluster, using cluster nodes that can be cheap VPSs, low-end servers with spinning disks, whatever. (In the test environment, nodes are just Linux containers.)
More: The architecture of Attaca, milestones, and current progress.
Next steps are to test it out with some scientific data (genomes, medical imaging, and so on), implement some more Git commands so that people can check files out and not just in, and build a (Raspberry Pi?) demo cluster.