Meta Unveils SAM Audio: Text Prompts Now Isolate and Edit Complex Sound
The unified multimodal AI allows users to isolate and edit sound segments instantly using simple text prompts and visual cues.
December 26, 2025

Meta Platforms has announced a significant expansion of its influential Segment Anything Model (SAM) family, introducing SAM Audio, a unified multimodal AI system designed to bring the precision and ease of prompt-based segmentation to the world of sound. This development effectively positions the company to democratize advanced audio editing, allowing content creators and professionals to isolate, extract, or remove specific sounds from complex audio mixtures using simple text prompts, visual cues from a corresponding video, or time-span selections. The model’s core innovation lies in its ability to treat sound elements—whether music, speech, or sound effects—as segments that can be addressed and manipulated through intuitive user input, similar to how the original SAM model allows users to select and segment objects in an image with a single click.
The technical foundation of SAM Audio represents a leap forward from previous, fragmented audio processing tools. Traditionally, separating individual sound sources from a noisy or mixed recording required specialized software, deep domain expertise, and labor-intensive manual work, often with limited flexibility for sounds outside of predefined categories like vocals or drums. SAM Audio, however, is presented as a unified generative AI model capable of handling diverse, real-world sound scenarios. Under the hood, the model utilizes a flow-matching Diffusion Transformer architecture that operates in a learned latent space, specifically a Descript Audio Codec – Variational Autoencoder Variant (DAC-VAE) space, which allows for efficient and high-fidelity reconstruction of sound. Crucially, its architecture incorporates a Perception Encoder Audiovisual (PE-AV) component, which enables the model to align and fuse information from multiple modalities—audio, video, and language—into a shared embedding space. This multimodal approach is what powers the model’s versatility, allowing a user to type "remove the train noise," click on a passing car in a video frame, or simply highlight the offending sound on a waveform to isolate or eliminate it.[1][2][3][4]
The efficiency and performance metrics released by Meta underscore the model's potential for integration into real-time production environments. SAM Audio is available in three sizes, ranging from 500 million to 3 billion parameters, with the largest variant capable of operating faster than real-time, achieving a Real-Time Factor (RTF) of approximately 0.7. This speed means the model can process audio 30 percent faster than its playback speed, making it suitable for live production workflows, video editing, and applications where immediate feedback is necessary. Benchmark tests have shown the model achieves state-of-the-art separation quality across a broad range of domains, including music, speech, and general sound effects. To aid researchers in further development and objective comparison, Meta also released SAM Audio Judge, a companion model for benchmarking audio separation results. Furthermore, the model's core output is a dual-stream generation—the isolated target sound stem and a residual mix of everything else—which directly translates into easy-to-use editor operations. For example, a podcaster wanting to remove a dog bark keeps the residual mix, while a music producer wanting to isolate a guitar track keeps the target stem.[1][2][5][4]
The implications of SAM Audio extend far beyond professional sound mixing, signalling a significant shift in the content creation and broader technology industries. By enabling natural language and click-based editing, the technology drastically lowers the barrier to entry for complex audio manipulation, effectively democratizing a skillset that was once restricted to highly trained audio engineers. This user-friendliness is particularly impactful for the rapidly growing cohort of social media content creators, video editors, and podcasters who often lack the resources or expertise for professional audio post-production. The ability to effortlessly clean up a noisy interview, strip traffic sounds from a vlog, or extract a specific instrument from a live recording using an intuitive interface is expected to streamline workflows and elevate the overall quality of user-generated content. Beyond creative fields, the technology holds promise for scientific research, such as isolating animal calls in wildlife recordings, and accessibility applications, like creating specialized tools for the hearing impaired.[6][7][8][5][9][10]
Meta's decision to release the code and model weights for SAM Audio as open-source, under a permissive license (the SAM License, allowing both research and commercial use), aligns with its strategy of establishing its foundation models as industry standards. This open release encourages rapid adoption and innovation, inviting developers and researchers globally to integrate the system into new applications, build upon its architecture, and test its limits. However, the release of such a powerful and versatile tool is not without its challenges. The open-source nature immediately raises questions regarding potential misuse, particularly concerning the ethical implications of isolating individual voices or manipulating recordings with unprecedented ease. While Meta has noted that the use of its materials must comply with all applicable laws and regulations, the speed and accuracy of the model in separating audio components, especially in the context of identifying individual speakers or sounds in private recordings, will necessitate an ongoing discussion about privacy and security in the age of prompt-driven audio AI. Ultimately, SAM Audio represents a decisive step in the multimodal AI landscape, moving beyond visual and text domains to offer a unified, accessible, and high-performing tool for audio segmentation, setting a new expectation for how sound is edited and understood by AI systems.[1][11][7][3][5][9]
Sources
[1]
[2]
[6]
[8]
[9]
[10]
[11]