Google Overhauls Gemini AI Usage Limits Following Backlash Over Rapid Quota Depletion
Google eases compute-based quotas and offers unlimited Flash-Lite access following intense backlash from frustrated AI subscribers
May 29, 2026

Google has rolled out a series of significant updates to the usage limits of its Gemini artificial intelligence platform in response to mounting user frustration. The adjustments come shortly after the company shifted from a traditional message-counting quota to a complex compute-based usage system[1][2]. While this shift was designed to better align resource consumption with the processing power required for advanced tasks, it quickly resulted in users hitting their access caps far sooner than expected[1][2]. Many subscribers reported being locked out of the service after executing only a handful of commands, prompting Google to implement quick patches to stabilize the user experience, fix critical bugs, and soften the impact of resource-intensive processes on individual allowances[2][3].
At the heart of the recent controversy is a major structural shift in how Google calculates the limits of its AI assistant. Rather than restricting users to a fixed number of messages per day, the company transitioned Gemini to a backend model where rolling five-hour windows and weekly caps are dictated by actual computational weight[1][4]. Under this framework, simple text queries draw very little from a user's quota, while heavy actions like generating code, running long multi-modal tasks, and uploading large files eat up significantly more resources[1][2]. This change was meant to manage server loads more dynamically, but it led to an immediate wave of complaints across online communities as users found the new quotas highly unpredictable and restrictive[1][5].
The unpredictability of the compute-based system was compounded by how Gemini handles ongoing conversations. Because generative AI models must re-evaluate the entire context window of a chat thread with each new prompt, keeping a single lengthy conversation active became a major resource drain[6][7]. Users unknowingly burned through their five-hour allowances simply by asking basic follow-up questions in a long-running thread, as the system re-processed the entire cumulative history of the chat every time[7]. This lack of transparency and rapid depletion of allowances left both free-tier users and paying premium subscribers feeling penalized for utilizing the core strengths of the platform[5][7].
One of the most pressing bugs Google addressed in its recent rollout involved its newly introduced Omni world model, which is designed to generate complex video content[8][3]. Shortly after its release, users began complaining that attempting to generate just one or two Omni videos would completely exhaust their entire computational quota, sometimes locking them out before the video generation request had even finished processing[3][9]. Google acknowledged that this rapid depletion was due to an internal system error that miscalculated the quota deduction for these intensive video tasks[8][4]. The tech giant confirmed that the bug has been resolved, preventing singular media requests from immediately triggering limits[1][4].
To appease its most dedicated users, particularly those paying for the top-tier Google AI Ultra subscription, the company has also announced that it is doubling the allocation of Omni video generations available to these members[1][3]. This move represents a strategic effort to rebuild trust with power users who felt the new restrictions severely crippled the high-end creative workflows they were paying to access[3][5]. By separating video generation from standard text-based caps and expanding the allowances for premium subscribers, Google is attempting to position Gemini as a viable professional tool while continuing to monitor backend resource distribution[1][3].
Beyond resolving specific video glitches, Google is introducing broader structural safeguards to prevent massive individual tasks from draining a user's entire account balance[3][9]. For those using the robust Gemini Pro model, the company is implementing a ceiling on the amount of computational quota any single prompt can consume[1][2]. This ensures that even if a user uploads an exceptionally large dataset, video, or codebase, the system will cap the penalty, allowing the user to continue using the Pro model for subsequent tasks without being instantly locked out[1][2].
To further alleviate quota pressure, Google has made its lighter Flash-Lite model entirely free of usage limits[1][8]. Prompts processed through this highly efficient model will no longer count toward a user's rolling five-hour or weekly limits[1][8]. This provides a reliable and unrestricted fallback for basic everyday tasks, encouraging users to toggle between models based on their needs[1]. Crucially, the company also clarified that users will no longer be penalized for system errors or failed requests[1][8]. Under the previous implementation, if Gemini failed to complete a task or threw an error, the attempt would still draw down the user's quota[9]. Moving forward, Google has pledged that system mistakes are on them, and only successfully completed requests will count against a subscriber's balance[1][8].
To combat the confusion surrounding how quotas are spent, Google is working to make its platform's backend operations more transparent[3]. The company is developing richer tracking tools and more detailed usage breakdowns, which will soon be integrated directly into the dashboard[1][4]. These tools will be particularly useful for heavy, agentic features like Deep Research, which naturally require massive amounts of compute to scour the web and synthesize information[1][7]. By showing users exactly how and where their computational budget is being spent, Google hopes to help users manage their workflows more predictably and avoid sudden service interruptions[1][4].
Additionally, Google is introducing a small but highly requested quality-of-life feature to improve how users manage their choice of model. The Gemini interface will now remember the last model selected by a user across all future chat sessions[1][8]. Previously, the app could silently revert to lighter models or change dynamically based on invisible criteria. Now, the chosen model will remain locked in default unless the user manually adjusts it or hits a hard computational cap that triggers an automatic, notified fallback[1][2]. This gives subscribers greater consistency and prevents the frustration of unknowingly using a weaker model for a task that requires advanced reasoning[1][5].
The rapid adjustment of Gemini's usage limits highlights a broader, industry-wide challenge facing major technology companies in the generative AI era. As AI models become capable of processing massive context windows and executing heavy multimodal and reasoning-based operations, the operational costs of running these systems are skyrocketing[6][7]. Balancing user satisfaction and predictable pricing against the volatile backend costs of server compute is proving to be incredibly difficult[5][7]. Google’s swift response to subscriber backlash serves as a case study in the delicate unit economics of consumer AI, proving that while advanced features are necessary to compete, maintaining a transparent, fair, and reliable user experience is equally critical to retaining a loyal user base.