In the rapidly evolving landscape of artificial intelligence, a fervent race is underway to develop sophisticated agents capable of complex reasoning and autonomous action. While the current paradigm heavily favors language models, leveraging their ability to process text, understand instructions, and interact with tools, a nascent startup, Standard Intelligence, is charting a distinctly different course. The company is making an ambitious wager that the most direct and scalable path to achieving truly general artificial intelligence lies not in the intricacies of human language or the abstraction of screenshots, but in the raw, unadulterated stream of video data.

This contrarian approach positions Standard Intelligence against the prevailing wisdom, which has largely focused on enhancing language models and building agent ecosystems around them. Coding agents, for instance, have demonstrated remarkable capabilities in understanding problems and generating code to solve them, pushing the boundaries of what AI can achieve in software development. However, Standard Intelligence’s thesis posits that this approach, while effective, might be inherently limited in its ability to achieve true generality.

The core of Standard Intelligence’s strategy is to train AI agents directly on the visual, sequential data of human computer usage. Instead of predicting discrete text tokens, their models learn to perform actions by analyzing pixels, predicting the precise sequence of mouse movements, clicks, and keystrokes required to navigate and interact with a computer interface. This methodology draws a compelling parallel to the advancements in autonomous driving technology, particularly the "Tesla Full Self-Driving" approach, which relies heavily on processing real-world visual data to enable vehicles to perceive and react to their environment. Standard Intelligence aims to apply a similar foundational principle to the realm of knowledge work and digital interaction.

This "bitter lesson"-inspired philosophy, as described in their publications, emphasizes learning from direct experience and scaling data aggressively, rather than relying on meticulously engineered workflows or intricate language model wrappers. By feeding their models the raw stream of computer use, they anticipate that general capabilities will emerge organically from the sheer volume and diversity of the data. This represents a significant departure from current trends, which often involve complex prompting strategies and the integration of numerous specialized tools.

Table of Contents

The Unconventional Path Through Video

The decision to focus on video data is, by all accounts, a challenging one. Video is notoriously unwieldy, demanding immense computational resources and presenting significant technical hurdles. Historically, attempts to scale video processing for advanced AI applications have often faltered due to these inherent difficulties. The Standard Intelligence team openly acknowledges their non-traditional background in video processing, stating they are "not video people." This lack of pre-existing biases or ingrained assumptions about video as a medium, however, appears to have fostered a unique problem-solving approach. They have had to derive solutions from first principles, tackling each challenge with a blend of optimism, ingenuity, and a pragmatic, resourceful attitude.

Their efforts have yielded impressive, tangible results, defying expectations for a young company. Standard Intelligence has reportedly amassed an 11-million-hour computer action dataset, a figure that stands as the largest of its kind in the industry. This dataset forms the bedrock of their training regimen. Furthermore, they have developed a video encoder that demonstrates a remarkable level of efficiency, being approximately 50% more token-efficient than existing competing approaches. This breakthrough allows for the ingestion of nearly two hours of 30 frames-per-second video data within a single, colossal 1-million-token context window. This capability is crucial for capturing the nuances and temporal dependencies inherent in computer usage.

Beyond the computational and algorithmic advancements, Standard Intelligence has also made significant strides in infrastructure. They have established a 30-petabyte storage cluster in San Francisco for an estimated cost of under $500,000. This figure represents a substantial cost saving, being approximately 20% cheaper than comparable solutions offered by major cloud providers, often referred to as hyperscalers. This cost-effective infrastructure management is a testament to their resourceful approach and ability to innovate beyond core AI development.

FDM-1: A Glimpse of Video-Driven Generality

The culmination of these efforts is evident in their foundational model, FDM-1. This model, trained directly and at scale on computer-use video, offers an early preview of the potential unlocked by this video-first pre-training paradigm. FDM-1 has demonstrated a surprising breadth of generalist capabilities. For instance, it can reportedly extrude a CAD gear in Blender, a complex 3D modeling software. In a more dynamic test, it was able to navigate a simulated car around a San Francisco block after just an hour of fine-tuning. Furthermore, the model has shown an aptitude for identifying software bugs by exploring the state space of applications in a manner analogous to a curious human user. These diverse applications highlight the model’s ability to generalize its learning across different tasks and domains, a key objective in the pursuit of artificial general intelligence.

The Visionaries Behind Standard Intelligence

The company’s trajectory is shaped by its young and driven founders, Galen Mead and Devansh Pandey. The pair first met as teenagers in 2022 during the Atlas Fellowship, a prestigious program for high school students focused on AI alignment and AGI. This early exposure to the critical questions surrounding advanced AI has seemingly instilled in them a profound sense of responsibility and urgency. Both Mead (21) and Pandey (20) have deferred their undergraduate studies, driven by a shared commitment to accelerating progress in the field of AGI.

Their leadership is characterized by a rare combination of refined taste, resourcefulness, technical audacity, and ambitious vision. This is not only reflected in their product development and research direction but also in the detailed and insightful reporting of their FDM-1 model. Their commitment to pursuing this challenging mission, eschewing more conventional career paths such as lucrative offers from established tech giants or immediate pursuit of advanced degrees, underscores their dedication.

The broader Standard Intelligence team, though small, comprises six exceptionally talented individuals. Neel, Yudhister, Ulisse, and Ryan are described as quirky and exceptional, each bringing unique skills and perspectives to the table. Their collective decision to join Standard Intelligence signifies a shared belief in the company’s unconventional but potentially groundbreaking approach to AI development.

A New Pre-Training Regime for the Age of AI

The concept of using video as a training ground for AI is not entirely new. Early successes, such as DeepMind’s Deep Q-Networks (DQN), demonstrated that agents could learn complex behaviors directly from pixel data in Atari video games. More recently, companies like Tesla have scaled video-centric models to enable autonomous navigation for vehicles and robots in the physical world. However, within the specific context of developing general knowledge agents and sophisticated digital assistants, video-first pre-training has remained a relatively unconventional idea, often overshadowed by the dominance of large language models.

Standard Intelligence is now betting that this will not remain the case for long. Their work suggests that by unlocking the potential of raw visual data, a new era of AI agents, capable of understanding and interacting with the digital world in a more profound and generalizable way, may be on the horizon. This strategic pivot towards video pre-training represents a significant challenge to the established norms in AI research and development.

The company has recently secured Series A funding, with Standard Intelligence announcing that they are "thrilled to lead Standard Intelligence’s Series A alongside Miko and Yasmin from Spark Capital." This investment signifies a strong vote of confidence from the venture capital community in Standard Intelligence’s vision and its potential to disrupt the current trajectory of AI agent development. The infusion of capital will likely accelerate their research, data acquisition, and model development, allowing them to further validate their video-centric approach and push the boundaries of what is currently considered possible in artificial intelligence. The implications of their success could extend far beyond the immediate applications, potentially reshaping how we interact with technology and how AI agents are developed and deployed across a multitude of industries.

agents bets Funding general hold intelligence keys pixels standard Startups training useful VC Venture Capital video

Could Pixels Hold the Keys to Training Useful Agents? Standard Intelligence Bets Big on Video Pre-Training for General AI

The Unconventional Path Through Video

FDM-1: A Glimpse of Video-Driven Generality

The Visionaries Behind Standard Intelligence

A New Pre-Training Regime for the Age of AI

Share this:

Related posts:

Factory Orders Show Signs of Growth in February, Signaling Potential Economic Rebound

Echo Health Ventures: Pioneering Systemic Healthcare Transformation Through Strategic Venture Capital

You may also like

Leave a Comment Cancel Reply