- AI Tangle
- Posts
- ☕️ The Biggest Night in AI
☕️ The Biggest Night in AI
AI Tangle Newsletter
The latter part of the week is filled with news of many major releases, like OpenAI's eyebrow-raising text-to-video generation model Sora and Google's next-gen Gemini 1.5, capable of shifting through 10 hours of video. But there's more than that, as Nvidia releases its own offline, locally run model, Mozilla takes a step back to refocus on Firefox, and Amazon trains the largest text-to-speech model yet. Join us at AI Tangle as we untangle what's been happening in the world of AI.
THE BIG AI STORY
Next to take up video generation after Google and Meta is OpenAI, as the company revealed Sora, its text-to-video generation model. Capable of both extending existing videos and creating movie-like videos with multiple characters, backgrounds, and different types of motion in a variety of styles, from photorealistic to animated to regular black-and-white, Sora is a big step forward for both OpenAI and AI-powered video generation as a whole.
What's the future of Sora?
Currently, Sora can create 1080p videos up to a whole minute long at best, and its understanding of coherence is reasonable, albeit still leaving much room for improvement. Occasionally, the AI still breaks coherence and casually defies laws of physics, such as cars changing direction instantly between frames or other similar. OpenAI acknowledges that Sora is not perfect, stating that it may struggle with simulating physics and not understand instances of cause and effect. OpenAI also states that Sora is currently a "research preview" while the company continues ironing out exploits to prevent malicious use of the model, adding that there's still much to discuss with policymakers and educators.
7 QUICK HITS
Not even two months after the initial release of Gemini, Google unveils Gemini 1.5, a vastly more capable and faster version of its predecessor, available to developers and enterprises now. The all-purpose Gemini 1.5 Pro, which leverages the "Mixture of Experts" technique, is claimed to beat Gemini 1.0 Ultra, which, in turn, felled its 1.0 Pro counterpart in roughly 87% of benchmark tests. But the most intriguing part of Gemini 1.5 is its ridiculous context window size at a whopping 1 million tokens, meaning Gemini can take 10-11 hours of video or tens of thousands of lines of code as input.
As Nvidia ramps up its production of AI chips and powerful video cards with AI capabilities, the company's next incentive is its Chat with RTX tool, a generative AI chatbot able to run locally and offline on a Windows computer (sorry, Linux and Mac!). By default, Chat with RTX leverages the open-source Mistral 7B model, though support for Meta's LLaMa 2 is available, too. Chat with RTX can be connected to documents and currently works with text, PDF, .doc, .docx, and .xml formats, which users can later query. However, a current major limitation of the product is its inability to remember context - asking a follow-up to an earlier one will yield disappointing results.
Not wanting to be a step behind, Stability AI launched its newest and most capable image generation model yet - Stability Cascade, available now on GitHub for researchers but not commercial. The company claims the model is "exceptionally easy" to train and, better yet, more powerful than its already successful flagship Stability Diffusion model, the basis for many image generation models today. The unique quirk of Stability Cascade is its build - it's not composed of one model but instead three, leveraging what's known as the Würstchen architecture.
Amazon's AGI team of researchers has trained "the largest text-to-speech AI model yet," claiming that it's starting to show hints of "emergent abilities." This model, named Big Adaptive Streamable TTS with Emergent abilities (or BASE TTS), was Amazon's hope of developing a text-to-speech AI model that, too, would have a "leap" in ability, becoming able to perform tasks that it wasn't trained to, much like the robustness of large language models today. With 980 million parameters, the BASE-large variant is the biggest text-to-speech model. However, the medium-sized variant at 400 million showed a flicker of hope to Amazon's researchers first.
Mozilla, the organization behind the popular Firefox browser, plans to downsize, as estimates believe the plan to affect roughly 60 employees. Mozilla's VPN, Relay, and its 3D virtual world Hubs, released in 2018, are just a few examples of the many products and services the company wishes to cut its investments back on. In an internal memo, Mozilla states that the downsizing is part of the organization's plans to return its focus to what made it big in the first place: Firefox, to add "trustworthy AI" to the browser.
On Tuesday, Google announced that it would be joining the non-profit Environmental Defense Fund (EDF) on a mission to map methane pollution and oil and gas infrastructure from space. The partnership is aimed at figuring out where exactly large leaks are happening in hopes of plugging them, especially as methane has much more severe near-term consequences than regular carbon dioxide. Just next month, EDF plans to launch MethaneSAT, a powerful satellite that will track methane emissions, while Google tries its hand at mapping global oil and gas infrastructure with AI.
Back in November, American homestay giant Airbnb acquired a "stealth AI firm" called GamePlanner, a deal that valued the startup at $200m. Now, Airbnb co-founder and CEO Brian Chesky is revealing some ambitious plans to put GamePlanner to use, wishing to create the "ultimate concierge" and build one of the "most innovative AI interfaces ever created." Airbnb is not an AI infrastructure company, so Chesky aims to leverage open-source models to achieve the goal instead, like those from OpenAI, Meta, and Google, additionally integrating GamePlanner's team and tooling into the ambitious venture.
4 AI TOOLS
Retell AI - For developers needing to add some vocals to their projects, Retell is an easy-to-use API for building human-like conversational AI agents with flow that just feels right.
Infobox - Infobox AI is a powerful personal AI assistant that allows you to create and customize your unique AI assistant. Providing a wide array of features to boot, Infobox makes maintaining your data simple and hassle-free.
Gem - Unify your recruiting tech stack with Gem, a platform powered with AI, CRM, and analytics to provide an all-in-one solution to recruiters' problems.
Dola - Dola's flexible GPT-4 powered AI agent calendar assistant makes managing your time simpler than ever, allowing you to chat your way to a stress-free schedule.
AI READ & WATCH
Are AI Chatbots Ruining Online Dating? (7-min read)
With the age of the internet came online dating, and with accessibility to the internet, online dating became a fierce competition. Now, as AI ramps up in importance, how and why are people using AI to give themselves a leg-up?
Gemini 1.5 and The Biggest Night in AI (28-min watch)
The release of both Sora and Gemini 1.5 has made for one monstrous Wednesday night, and AI Explained, a YouTube channel dedicated to covering the latest in AI, is keen to delve into the papers of Gemini and explain what's happening under the hood.