Let me paint a picture: Bob is a talented engineer, eager to make a difference on the team. He works hard and releases several changes to production. However, despite following the proper testing and release procedures, a bad release manages to slip through, impacting customers and causing frustration.

Although Bob isn’t to blame, the incident takes a mental toll. He starts worrying that his next change might introduce a similar issue, making him hesitant to release future updates.

How can we improve the process to ensure better outcomes for both internal and external customers?

At Atlan, our mission is to help the humans of data do more, together. As we rapidly iterate and release new features, we must balance fast development cycles with robust release management to minimize disruptions for our users.

The Initial Release Process

Our initial approach aimed to test changes in multiple environments—beta, staging, canary, and finally, general availability (GA)—before rolling out updates broadly.

  • Beta: The first stage of testing, where a small group of users accessed experimental versions.
  • Staging (Mabl): We then pushed updates to a staging environment, where automated testing tools like Mabl simulated real-world conditions.
  • Canary: A subset of production traffic was routed to the new release, allowing us to monitor for early signs of issues.
  • General Availability (GA): If the canary phase was successful, the release was rolled out to all users.

While this process worked in theory, we faced several challenges—particularly around long-running canaries, maintenance overhead, and version divergence from the master branch.

Challenges with Our Existing Process

  1. Canary Runs Were Too Long: The canary phase often stretched beyond the intended timeframe. Instead of using configuration toggles, customer-specific code was deployed directly through canaries, resulting in canaries that lived indefinitely.
  2. Maintenance Overhead: Managing multiple canaries became a nightmare, especially as they diverged from the master branch. Any new changes in GA had to be merged back into all active canaries, creating additional complexity.
  3. Version Divergence from Master: The longer a canary remained active, the more it diverged from the master branch. This led to difficult merge conflicts and increased engineering effort to keep branches aligned.
  4. Manual Canary Management: The process relied heavily on manual steps. Developers had to:
    • Manually raise a canary Pull Request (PR)
    • Manually raise another PR to revert a canary in case of regressions
    • Manually delete the canary after promoting to GA
    • Manually merge GA changes back into existing canaries

The Human Bottleneck

All of these issues stemmed from human intervention in the release process. Manually managing canaries was not scalable, prone to human error, and added unnecessary cognitive load for developers.

Here is what Engineer Bob would say about this current situation: “The existing process makes sense in theory, but in practice, it’s frustrating. Most of my time is spent managing the process instead of actually working on my tasks. I have to manually select customers, create canaries, delete them after GA, and merge changes back into other canaries. I want to focus on my work, not the procedure.”

Introducing Ring Deployment

To address these challenges that engineers like Bob was facing, we introduced Ring Deployment—a structured approach to minimize impact while improving stability.

Ring Deployment is a phased approach to releasing software, where changes are gradually rolled out to predefined “rings” of users or systems. Each ring represents a group of users with different levels of exposure to the new release. This incremental rollout helps mitigate risk by allowing us to monitor the impact of changes at each stage before scaling up to a broader audience.

Key Principles of Ring Deployment

  1. Minimize Blast Radius: Releases are implemented and rolled out in a controlled manner to reduce the impact on multiple customers.
  2. Leverage Customer Data: Some failures only emerge with real customer data. Testing with a subset of users allows us to identify and address these issues before a full rollout.
  3. Automation: Ring Deployment heavily relies on automation to trigger, monitor, and manage releases, enabling faster response times in case of errors.
  4. Visibility: Clear monitoring and observability at each phase ensure that teams can track progress and quickly identify issues.

Ring Deployment vs. Canary Releases

In essence, Ring Deployment can be viewed as multiple canaries, with each subsequent ring getting progressively larger. The process typically starts with a small canary phase in the first ring, followed by gradual expansion to larger rings as the release proves to be stable.

Integration: Bringing Ring Deployment to Life

“The success of a theory is not in its abstraction, but in its ability to adapt to the complexities of the real world and inform effective practice.” — Donald A. Schön

At Atlan, our primary goal in implementing Ring Deployment was to reduce the blast radius of deployments. To achieve this, we introduced the concept of a Ring-Cohort, which determines how changes are deployed across customer clusters.

Understanding Ring-Cohorts

  • Ring: A ring represents a release environment. It defines a change set that customer environments synchronize with.
  • Cohort: A collection of customers selected based on specific criteria, such as connector type, customer health score, internal vs. external users, and risk level.

How It Works

Developers simply select the desired ring and cohort, and the rest is automated:

  1. Automatic Deployment: Changes are deployed to selected customer cohorts.
  2. Automated Failure Notifications: A monitoring process runs periodically, updating cohort statuses and notifying engineers if issues arise.

Key Improvements

Higher Success Rate

  • Ring Deployment significantly reduces the blast radius of faulty deployments.
  • Since its implementation, we’ve observed consistently higher success rates and fewer failures related to deployments.

Automation for Scalability & Reduced Cognitive Load

  • Every aspect of the release process is automated, minimizing manual intervention and reducing developer fatigue.

Automatic Release Management

  • Cohorts are predefined using a mix of metrics.
  • Engineers no longer need to manually curate and create canary configurations—just select a cohort label.

Automated Monitoring & Feedback

  • Feedback loops are fully automated across each ring.
  • Developers receive real-time alerts on deployment successes and failures.

Fast Rollback of Broken Releases

  • Simply closing the PR triggers a GitHub action that expires the ring configuration.
  • No more stressful manual rollbacks or waiting for approval.

Engineer Bob: “This is a game changer to me. I can focus my energy on the actual task instead of procedure and enjoying a much better work life balance

Challenges and Lessons Learned

The transition to Ring Deployment wasn’t entirely smooth. While most users quickly embraced the new process and appreciated its structured approach, scaling it across Atlan’s engineering ecosystem introduced a new set of challenges.

One of the biggest hurdles was iterating on deployment workflows within GitHub Actions. What seemed straightforward in theory turned out to be more complex in practice. Testing new deployment configurations required setting up a dedicated framework—one that could validate changes without affecting real production environments. However, creating and maintaining these test environments demanded time and resources, making each iteration slower than anticipated.

Another challenge was ensuring deployment safety at scale. While Ring Deployment provided a structured rollout process, we needed a failsafe to prevent incorrect releases from propagating. Fortunately, the Ring Deployment Config acted as a critical guardrail. By enforcing checks before updates reached customer clusters, we were able to catch potential issues early and prevent unintended changes from impacting users.

Despite these challenges, the benefits of Ring Deployment far outweighed the difficulties. With each iteration, we refined our processes, improved automation, and reduced the cognitive load on developers.

What’s next

While Ring Deployment has already made a significant impact, we see even greater potential ahead. The next step is to expand its adoption beyond the initial services. What started as a targeted initiative has now become a foundational part of Atlan’s deployment strategy. It’s frequently referenced in Root Cause Analyses (RCAs), proving its effectiveness in minimizing risks. By rolling it out across more teams and services, we can further reduce the blast radius of faulty deployments and ensure even greater system stability.

Another critical area for improvement is enhancing the integration between Ring Deployment and QA. Right now, our feedback loops rely on post-deployment monitoring, but there’s an opportunity to bring more detailed analytics into the testing phase itself. By feeding real-time error reports and performance insights from each ring stage directly to QA, we can accelerate bug identification, reduce flaky errors, and streamline automated workflows. This tighter integration between development and QA will lead to faster releases and fewer issues making their way into production.

Finally, we’re exploring the idea of multi-stage ring deployment to further smooth out the blast radius. By introducing more granular stages within each ring, we can make the transition from testing to general availability even more controlled. This would allow us to adjust rollout speed dynamically based on observed stability, giving us even finer control over how and when features reach customers.

If solving challenges like these excites you, check out our careers page—we’re always looking for engineers who are passionate about building scalable systems and improving developer experiences.

References

Ring Deployment Model for Apps – Advantages and Best Practices for Enterprises

A Modern Approach To Software Quality | mabl

Author

Write A Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.