Hugging Face at this time introduced it has acquired Seattle-based XetHub, a collaborative improvement platform based by former Apple researchers to assist machine studying groups work extra effectively with giant datasets and fashions.
Whereas the precise worth of the deal stays undisclosed, CEO Clem Delangue mentioned in an interview with Forbes that that is the biggest acquisition the corporate has made to this point.
The HF workforce plans to combine XetHub’s expertise with its platform and improve its storage backend, enabling builders to host extra giant fashions and datasets than at the moment attainable — with minimal effort.
“The XetHub workforce will assist us unlock the following 5 years of progress of HF datasets and fashions by switching to our personal, higher model of LFS as a storage backend for the Hub’s repos,” Julien Chaumond, the CTO of the corporate, wrote in a weblog submit.
What does XetHub convey to Hugging Face?
Based in 2021 by Yucheng Low, Ajit Banerjee and Rajat Arya, who labored on Apple’s inside ML infrastructure, XetHub made a reputation for itself by offering enterprises with a platform to discover, perceive and work with giant fashions and datasets.
The providing enabled Git-like model management for repositories going as much as TBs in measurement, permitting groups to trace modifications, collaborate and preserve reproducibility of their ML workflows.
Throughout these three years, XetHub drew a sizeable buyer base, together with main names like Tableau and Collect AI, with its means to deal with complicated scalability wants stemming from consistently rising instruments, recordsdata and artifacts. It improved storage and switch processes utilizing superior methods like content-defined chunking, deduplication, on the spot repository mounting and file streaming.
Now, with this acquisition, the XetHub platform will stop to exist and its information and mannequin dealing with capabilities will come to the Hugging Face Hub, upgrading the mannequin and dataset sharing platform with a extra optimized storage and versioning backend.
On the storage entrance, the HF Hub at the moment makes use of Git LFS (Massive File Storage) because the backend. It launched in 2020, however Chaumond says the corporate has lengthy identified that the storage system wouldn’t be sufficient after one level given the consistently rising quantity of huge recordsdata within the AI ecosystem. It was an excellent level to begin off, however the firm wanted an improve, which can include XetHub.
At the moment, the XetHub platform helps particular person recordsdata bigger than 1TB with the entire repository measurement going properly above 100TB, making a significant improve over Git LFS which solely helps a most of 5GB of file measurement and 10GB of repository. This can allow the HF Hub to host even bigger datasets, fashions and recordsdata than at the moment attainable.
On prime of this, XetHub’s further storage and switch options will make the bundle much more profitable.
As an example, the content-define chunking and deduplication capabilities of the platform will let customers add choose chunks of recent rows in case of a dataset replace quite than re-uploading the entire set of recordsdata once more (which takes a variety of time). The identical would be the case for mannequin repositories.
“As the sector strikes to trillion parameters fashions within the coming months (thanks Maxime Labonne for the brand new BigLlama-3.1-1T ?) our hope is that this new tech will unlock new scale each in the neighborhood and within enterprise corporations,” the CTO famous. He additionally added that the businesses will work intently to launch options geared toward serving to groups collaborate on their HF Hub belongings and monitor how they’re evolving.
At the moment, the Hugging Face Hub hosts 1.3 million fashions, 450,000 datasets and 680,000 areas, totaling as a lot as 12PB in LFS.
It will likely be attention-grabbing to see how this quantity grows with the improved storage backend, permitting help for bigger fashions and datasets, coming into play. The timeline for the mixing and launch of different supporting options stays unclear at this stage.