Join top executives in San Francisco on July 11-12 to hear how leaders are integrating and optimizing AI investments for success. Learn more
Databricks today released Dolly 2.0, the next version of the large language model (LLM) with ChatGPT-like human interactivity (aka following instructions) that the company released just two weeks ago.
The company says Dolly 2.0 is the first open source, instruction-following LLM tailored to a transparent and freely available dataset that is also open source for commercial use. That means Dolly 2.0 is available for commercial applications without having to pay for API access or share data with third parties.
Admittedly, there are other LLMs that can be used for commercial purposes, says Ali Ghodsi, CEO of Databricks: “They won’t talk to you like Dolly 2.0.” And, he explained, users can modify and improve the training data because it is made freely available under an open source license. “So you can make your own version of Dolly,” he said.
Databricks has released the dataset on which Dolly 2.0 has been trained
In addition, Databricks said that as part of its ongoing commitment to open source, it is also releasing the dataset Dolly 2.0 has been trained on, called databricks-dolly-15k. This is a corpus of over 15,000 records generated by thousands of Databricks employees, and Databricks says it is the “first open source, human-generated instructional corpus specifically designed to enable major languages to use the magical interactivity of ChatGPT to show.”
There has been a spate of instruction-following, ChatGPT-like LLM releases over the past two months that are considered open source (or provide some degree of openness or gated access) by many definitions, including Meta’s LLaMA, which in turn inspired others, such as Alpaca, Koala, Vicuna and Databricks’ Dolly 1.0.
However, Databricks has found a way around this problem: Dolly 2.0 is a 12B parameter language model based on the open source Eleuther AI pythia model family and tailored exclusively to a small, open source corpus of instruction records (databricks-dolly-15k) generated by Databricks contributors. The license terms of this dataset allow it to be used, modified, and extended for any purpose, including academic or commercial applications.
Models trained on ChatGPT’s output have been in a legal gray area until now. “The whole community has tiptoed around this and everyone is putting out these models, but none of them can be used commercially,” Ghodsi said. “So that’s why we’re super excited.”
Dolly 2.0 is small but mighty
A Databricks blog post emphasized that the 2.0 version, like the original Dolly, is not state-of-the-art, but “displays a surprisingly capable level of instruction-following behavior given the size of the training corpus,” adding that the level of effort and cost required to build powerful AI technologies is “orders of magnitude less than previously thought”.
“Everyone wants to get bigger, but we’re actually interested in getting smaller,” Ghodsi said of Dolly’s petite size. “Second, it is of high quality. We have looked at all the answers.”
Ghodi added that he believes Dolly 2.0 will create a “snowball effect” – where others in the AI community can join in and come up with other alternatives. The limit on commercial use, he explained, was a major obstacle to overcome: “We are delighted that we have finally found a way to do it. I promise you’ll see people apply the 15,000 questions to every model out there, and they’ll see how many of these models suddenly become a little magical where you can interact with them.
VentureBeat’s mission is to become a digital city plaza where tech decision makers can learn about transformative business technology and execute transactions. Discover our Briefings.