Join top executives in San Francisco on July 11-12 to hear how leaders are integrating and optimizing AI investments for success. Learn more
Thought the open source AI refers to camelids we are done? Think again: yesterday, Togethera Menlo Park, California-based company focused on building decentralized cloud and open source models announced Red pajamas (Yeah like Llama Llama red pajamas) yesterday.
“In many ways, AI is having its Linux momentthe company said in a blog postciting a post written in January by Chris Re, co-founder of Together, Stanford associate professor and co-founder of SambaNova, Snorkel.ai, and Factory.
RedPajama is a collaborative project between Together, Ontocord.ai, ETH DS3Lab, Stanford CRFM, Blurred researchAnd MILA Québec AI Institute to create industry-leading, fully open-source large language models (LLMs). The effort began with yesterday’s release of a 1.2 trillion token dataset following the Llama recipe. The data enables any organization to pre-train models that can be licensed permissively. The full dataset is available at Hugging face and users can reproduce results with Apache 2.0 scripts available at Github.
LLaMA is a state-of-the-art foundational LLM released in February by Meta with gated access for researchers. Several other models based on LLaMA have come out in recent weeks, including Alpaca, Vicuna and Koala – but those models are not available for commercial use. There was also some LLaMA drama when the LLaMA model was leaked on 4chan.
In the coming weeks, Together will release a full suite of LLMs and instructions-tailored versions based on the RedPajama dataset. The company emphasized that the upcoming models will be fully open-source and commercially viable. In a tweetthe company said, “We hope this can be a cleanroom, drama-free release. The RedPajama models we release in the coming weeks will be released under the Apache 2.0 license.”
RedPajama is part of a wave of open source AI
As VentureBeat reported last week, open source AI has had a moment in recent weeks, following the wave of LLM releases and an effort by startups, collectives, and academics to push back the shift in AI to closed, proprietary LLMs.
And a camelid adjoining model, Dolly 2.0 (as in Dollie the sheep), also made headlines last week when its developer, Databricks, called it the first open, instruction-following LLM for commercial use.
But the largest, state-of-the-art open source LLMs like LLaMA are limited to the research community. “They’re limited because you can’t build real applications and ship them,” said Vipul Ved Prakash, founder and CEO of Together and formerly co-founder of Cloudmark and Topsy. “We think having permissively licensed models is a critical aspect of open source AI.”
Replicating the LLaMA dataset was no easy feat
The company started with LLaMa, which it called the “leading suite of open base models” because it was trained on a “very large dataset that was carefully filtered for quality.” Also, the 7 billion parameter LLaMA model “is trained much longer, well beyond the Chinchilla optimal point, to ensure the best quality at that model size.”
While neither the dataset nor the model will be identical, the developers aim to create a fully open source reproduction of LLaMA that would be available for commercial applications, and provide a “more transparent pipeline for research.”
The developers didn’t have access to the LLaMA dataset, but had enough recipe to continue. “We followed the recipe very carefully to essentially recreate [the LLaMA dataset] from scratch,” Prakash said. The dataset consists of seven data slices, including data from Common Crawl, arxiv, Github, Wikipedia, and a corpus of open books.
“For each data segment, we perform careful data pre-processing and filtering, and fine-tune our quality filters to roughly match the number of tokens reported by Meta AI in the LLAMA paper”, read the blog post.
“All of the data LLaMA has been trained on is openly available data, but the challenge was that they didn’t provide the actual dataset — there’s a lot of work to do to get from the outline to the actual dataset,” said Prakash. He explained that, for example, the newspaper would describe how they chose the best 10,000 out of a million documents, but they didn’t give you the 10,000. “So we followed the recipe of repeating all that work to create an equivalent data set,” he said.
The debate about building transparent systems
Prakash said that the RedPajama project staff believe it is important that systems are transparent. “You know exactly how this model was built, what went into it,” he said. “If you’re trying to improve it, you can start with the dataset.”
The project also brings together a larger community for these models, he added. “I would say that academia has really been left out of basic model research because of the level of resources required, from data to the computer,” he said. He added that today there are a small number of people in the world working on these big models, and if there was wider access, “a lot of brilliant people” around the world would be able to explore different directions of neural architectures, training algorithms and security research.
“This is also one of the first truly general AI that can be adapted to different tasks, and we think its applicability is very broad,” he said. “But many different applications are only possible if you can access the model, the model weights and adapt them to different computing environments. We see this happening a lot because of open source AI.
However, there is another side to the open source AI debate. For example, Ilya Sutskever, chief scientist and co-founder of OpenAI, said recently it was “wrong” to share research so openly, saying that fear of competition and fear of safety were “taken for granted”. He added that “at some point it will be very easy to do a lot of damage with those models, if you wanted to.”
And in one recent interview speaking to VentureBeat, Joelle Pineau, VP of AI research at Meta, said that while accountability and transparency in AI models are essential, the key for Meta is to balance the level of access, which can vary depending on the model’s potential harm .
“My hope, and it’s reflected in our data access strategy, is to figure out how to enable transparency for verifiability audits of these models,” she said, adding that access can be determined based on the level of potential harm of the data. fashion model.
On the other hand, she said some levels of openness go too far. “That’s why the LLaMA model had a gated release,” she explained. “Many people would have been very happy to open up completely. I don’t think that’s the right thing to do today.”
Also debates about ethical datasets
There have also been discussions about the ethics of the data sets themselves, whether the models are open or closed. A article last week in The Guardian said that the “massive datasets used to train the latest generation of these AI systems, such as those behind ChatGPT and Stable Diffusion, probably contain billions of images scraped from the internet, millions of pirated ebooks, the entire 16-year European Parliament procedure and the entire English-language Wikipedia.”
But Prakash says he thinks “somehow these models capture the output of human society and there’s a kind of obligation to make them open and usable for everyone.” He added that “most of the magic” of these models comes from the fact that they have been trained on “very broad and extensive” data.
He also pointed out that the original data has been compressed significantly in the actual model. The RedPajama dataset is 5 terabytes and the models can be as small as 14 GB, ~500x smaller than the original data they model.
“This means knowledge is abstracted from the data, transformed and modeled into a very different representation of weights and biases of parameters in the neural network model, and not stored and used in its original form,” Prakash said. It is therefore “not reproducing the training data – it is also derivative work. From our understanding, it’s considered fair use as long as the model doesn’t reproduce the data – it learns from it.”
There is no doubt that the open source AI debates are very complex. But when asked why the company called the new project RedPajama, the answer was much simpler. “Many of us have small children,” Prakash said. “It just seemed fun.”
VentureBeat’s mission is to become a digital city plaza where tech decision makers can learn about transformative business technology and execute transactions. Discover our Briefings.