Gemini image generation problem 🛑, Microsoft AI servers 💾, simpler RL algorithms beat PPO 🤖

TLDR

Together With

TLDR AI 2024-02-26

6 hard lessons we learned about automated testing for genAI apps (Sponsor)

Testing LLMs is not simple. Probabilistic output makes failures hard to identify, while running the models repeatedly tends to become very expensive quickly.

In this blog post, QA Wolf engineer John Gluck covers 6 things the team learned about building automated black-box regression tests for genAI applications, including:

Writing deterministic tests for non-deterministic outputs
How to control costs (where possible)
Should you use AI to test AI?

Read the article on the QA Wolf blog (ungated)

🚀

Headlines & Launches

Microsoft Reportedly Makes AI Server Gear To Cut Nvidia Dependence (1 minute read)

Microsoft is creating its own AI server hardware to decrease its dependency on Nvidia.

Google admits it lost control of image-generating AI (5 minute read)

Google has acknowledged an issue with its AI model Gemini. The model injected inappropriate diversity into historical images, reflecting problems with bias in training data. The flaw sparked debate around diversity, equity, and inclusion in tech. Google implied that it will make improvements in the future but stopped short of a full apology for failing to properly contextualize historical figures in image generation.

Is OpenAI the next challenger trying to take on Google Search? (1 minute read)

OpenAI is working on a web search to compete more directly with Google. It is unclear if the product will be standalone or part of ChatGPT. Competition in the search space is filling up quickly with the addition of Copilot on Bing, newcomers like Perplexity, and Google's Gemini Copilot. A YouTube Short featuring Microsoft CEO Satya Nadella talking about competing with Google is available in the article.

🧠

Research & Innovation

Simpler RL algorithms beat PPO (22 minute read)

REINFORCE is a simple, standard, and easily understood RL method. It is hard to train stably when used in simulators. PPO is much more performant and stable in general. Gemini uses REINFORCE and GPT-4 is believed to use PPO.

Flow Matching for Generating Proteins (28 minute read)

AlphaFold is used to predict the state of a protein after folding. By adding flow matching, which is invertible, you can dramatically improve modeling power on the entire landscape of proteins.

Efficient LLMs with Expert-Level Sparsification (GitHub Repo)

Researchers have developed a new way to make LLMs more efficient and easier to use by employing a method that focuses on 'expert-level sparsification', which reduces model size without losing performance. This is particularly useful for Mixture-of-Experts LLMs, which are powerful but usually too big to handle easily.

🧑‍💻

Engineering & Resources

Virtual conference: Master AI and ML in production at apply() (Sponsor)

How do you transition from experimental models to highly scalable applications? apply() is a free community conference, where you can learn from seasoned engineers and technical leaders. Don’t miss out on the chance to hone your AI/ML skills in these FREE practical sessions and tech workshops! Save your spot now

Improved Hand-Object Interactions (3 minute read)

GeneOH Diffusion is a new technique that improves how models understand and interact with objects using hands. This method focuses on making these interactions more natural by correcting errors in hand movements and relations with objects.

Snap Video Model (8 minute read)

Snap research has trained a video generation model that reaches the previous state of the art (excluding Sora) while being 3x faster to run.

Production ready RL library from Meta (GitHub Repo)

Deployed in auction bidding systems, recommendation engines, and more at Meta, Pearls is ready for research and deployment.

🎁

Miscellaneous

OpenCodeInterpreter approaches GPT-4 code performance (4 minute read)

A model based on CodeLlama and DeepSeek Coder was able to get 85%+ on the HumanEval benchmark for programming by training on a synthetic multi-turn dataset and using human feedback.

Interpretability update from Anthropic (14 minute read)

Anthropic's research scientists have been working on a method of understanding deep neural networks that uses Circuits. These Circuits aim to identify subparts of models that get used for certain tasks. The research team has released a monthly update on the experiments they attempted and the results.

A New Benchmark for Search Engines (16 minute read)

INSTRUCTIR is a new benchmark aimed at making search engines smarter in understanding users' intentions. Unlike current methods, which mostly focus on the query itself, INSTRUCTIR evaluates how well search engines can follow user instructions and adapt to various and changing search needs.

⚡

Quick Links

Enhancing the safety and reliability of computer vision systems (Sponsor)

Join this Kolena webinar to explore unique challenges in computer vision - including precise object detection and real-time decision-making - through the lens of autonomous vehicle systems. Join free

LLMs for Annotation (GitHub Repo)

This is a curated list of papers about using LLMs for Annotation.

Mindy (Product)

Email-based Chief of Staff powered by AI.

Sam Altman Wants $7 Trillion (8 minute read)

Sam Altman's request for $7 trillion aims to support the rapidly escalating costs of advancing generative AI models like GPT, suggesting an exponential growth in resource needs for future iterations. This ambition underscores a pivotal moment in AI development, balancing between rapid technological progress and the broader implications of such swift advancement on safety and societal readiness.

Want the best of TLDR? 🏆

Refer a friend to TLDR AI using the referral link below, and we will send you the TLDR Hall of Fame, our 50 best stories of all time!

Your Referral Link - https://tldr.tech/ai?ref=5785442

We help cutting edge companies hire world class technical talent through our job listings. If you're hiring AI researchers, machine learning engineers, data scientists or other tech talent, click here to learn more.

If your company is interested in reaching an audience of AI professionals and decision makers, you may want to advertise with us.

If you have any comments or feedback, just respond to this email!

Thanks for reading,
Andrew Tan & Andrew Carr

If you don't want to receive future editions of TLDR AI, please click here to unsubscribe.