SSL certificates with ticket alerts

A UK-based service provider runs an automated multi-tenant SaaS platform on which every customer can serve their service from their own web address. Each custom domain comes with an SSL/TLS certificate that the platform issues and renews automatically through Let’s Encrypt — an unattended certificate lifecycle that protects every customer’s site without anyone touching it. Most renewals succeed silently. When one doesn’t — a customer’s DNS was retracted, the ACME rate-limit hit, an IIS binding clashed — the platform turns the error into a Tickiti ticket in the ops queue rather than firing an exception email. The team gets one structured alert, with a deep link, a priority and the same escalation rules as any other queue — not an inbox full of identical mails. This case study describes the certificate operations and the ticket-based alerting pattern that runs them.

The certificate lifecycle

Each tenant’s certificate goes through the same automated lifecycle:

Issuance. When a customer points a custom domain at the platform’s IP, an ACME HTTP-01 challenge with Let’s Encrypt issues a fresh certificate and the platform binds it to the customer’s site. From the customer’s side: enter a hostname, point DNS at the IP shown, and within a few minutes the badge flips to “Live”.
Live. The certificate serves every customer request on that hostname. Until the next renewal there is nothing to do.
Scheduled renewal. The ACME client renews the certificate before expiry. The renewal is fully unattended; on success, the new certificate replaces the old in place and the next renewal is scheduled.

The same lifecycle covers every tenant the service provider hosts. Most certificates spend almost all of their time in the “live” state, with the renewal job doing its work in the background.

The SSL/TLS certificate lifecycle. Every stage can fail occasionally; every failure becomes a single Tickiti ticket in the ops queue, inheriting the same queueing, escalation and audit-trail behaviour as customer tickets.

When the automation can’t finish on its own

Automated certificate management is robust most of the time, but a handful of failure modes always need a person:

DNS retracted. The customer’s A record was deleted or pointed elsewhere between issuance and the next renewal. The platform’s pre-flight DNS check fails before it even calls Let’s Encrypt — deliberate, because the ACME failure budget (five failed validations per hour per host) is a finite resource.
ACME rate limit hit. A run of bad attempts on a noisy customer domain has exhausted the issuance budget. Further attempts will be refused for a window.
IIS binding conflict. Two tenants somehow ended up claiming the same hostname (a registration race, a renamed tenant), and the platform refuses to bind both.
ACME client unavailable. The certificate client itself failed to run — missing binary, environment fault, process timeout. The platform can’t self-repair this.
Renewal silently approaching expiry. Renewal has been failing on a paying customer’s live domain for long enough that the cert is in danger of expiring.

These are the events the service provider must know about — promptly, reliably, and with enough context to act.

Why email exceptions don’t work for this

The instinctive approach — let the failing job send an exception email to ops — falls down in production:

An inbox isn’t a queue. Exception emails arrive in someone’s personal mailbox. There is no shared assignment, no claim, no “who’s on this”.
No deduplication. A persistent root cause — a flapping DNS, a stuck client — sends the same exception every few minutes for hours. The signal drowns in repetition.
No escalation. An unread alert at 2am stays unread until the morning. Email has no built-in “if nobody acted within an hour, do X”.
No audit trail. Tracing what happened to a past cert incident means digging through mail. The system of record is everyone’s inbox at once, which is the same as no system of record.
No structured metadata. Exception emails are free text. You can’t filter, count, or build dashboards from them — the rich operational reporting the team already has for customer tickets simply doesn’t exist for ops issues.
Mail itself is fragile. Spam filters, mail-server hiccups, addresses that quietly stop forwarding — an alerting channel built on email has more failure modes than the system it’s alerting on.

Tickiti alerts as the replacement

The platform has a single small helper that any service can call when it detects a problem: it raises a Tickiti ticket via the create-ticket API instead of sending an email. Each call carries the data that makes the ticket actionable:

An alert key — a stable identifier such as cert_issuance_failed, cert_renewal_failed, dns_retracted or acme_client_unavailable. The key is both the dedup grouping and the lever the operator pulls if a class of alert is too noisy.
A subject — one line including the affected hostname, so the queue is scannable at a glance.
An HTML body — the underlying error, the relevant tenant, any links.
A priority — so an expiring-cert alert outranks a low-priority informational one.
A deep link back to the affected tenant in the platform’s admin UI — one click from the ticket to the place an operator can act.

The resulting ticket lands in the ops queue. From there, the platform’s engineers work it the same way the support team works customer tickets: assign, prioritise, respond, close. The plumbing is the existing ticket workspace; no separate tool, no separate dashboard.

The platform already knows when something has gone wrong. The job of the alerting helper is to make sure a person finds out, exactly once, with the right context.

The safeguards that make this safe to use everywhere

Three properties of the helper turn it from “a way to spam tickets” into “a way to alert ops without ever flooding them”:

Per-alert kill switch. Every alert key has an “armed” flag in an operator-facing admin page. Persistent noise on one key while the root cause is being fixed? Disarm it; the helper records the suppression with counters but raises nothing. Re-arm when ready. No code change required to silence a flapping alarm.
Rate-limit deduplication. A five-minute window per (alert key, content) collapses storms into one ticket. The first cert-renewal failure for a host raises a ticket; the next ten in that window do not. When the next window opens, if the problem persists, a second ticket is raised — so “we know” and “still happening” are both visible without the queue flooding.
Fail-safe dispatch. If the call to Tickiti itself fails — the API unreachable, an auth issue — the helper logs and returns. It never bubbles the alerting failure back into the calling job. The platform does not create an alert storm for itself when the alert channel hiccups.

Inside the alerting helper. The kill switch and the dedup window guard every dispatch, so the same code path runs in every error handler without ever flooding the queue. Once the ticket is in, the queue’s escalation rules pick up automatically.

Escalations — the headline benefit

The most valuable single change from moving off exception emails is that escalations become a configured property of the queue, not a personal habit. A cert-renewal failure that nobody touches within an hour can:

re-prioritise itself,
bounce to a different assignee,
notify a second tier,
show on the team’s “stalled” perspective for the duty manager,

…using the same escalation rules the team already runs on customer tickets. SSL incidents become part of the same operational rhythm as every other inbound issue — no separate plumbing, no separate tooling, no “but who watches the alerts mailbox at 2am” question.

A typical SSL incident, end to end

A renewal for a customer’s custom hostname fires overnight. The customer has accidentally deleted their A record while reorganising DNS the previous afternoon. The platform’s pre-flight DNS check fails before it calls Let’s Encrypt, the renewal job catches the error and calls the alerting helper with the key cert_renewal_failed, a priority that reflects this is a paying live domain, the underlying error, and a deep link back to the tenant in the admin UI. A single ticket appears in the ops queue. The 2am incident is unattended; the queue’s escalation rule re-prioritises the ticket after an hour and notifies the duty engineer’s pager. They click the deep link, identify the DNS problem, and reach out to the customer through the same Tickiti workspace. The renewal is rescheduled for the moment DNS is fixed; the ticket is closed. The whole story — alert, escalation, customer conversation and resolution — lives on a single ticket and is findable by hostname in seconds.

At-a-glance summary

Alert key	When it fires	What the operator does
`cert_issuance_failed`	The initial issuance ACME challenge fails on a new custom domain	Diagnose the failure (rate-limit, DNS, IIS), fix, retry
`cert_renewal_failed`	A scheduled renewal can’t complete	Triage the cause; contact the customer if it’s their DNS
`dns_retracted`	The customer’s DNS no longer resolves to the platform’s IP	Reach out to the customer to restore their A record
`acme_client_unavailable`	The ACME client itself failed to run	Investigate the environment; restore the client

What the service provider gets

One queue for SSL incidents alongside every other operational issue. No separate alerting tool, no separate inbox, no separate dashboard to learn.
Escalations applied automatically. Unanswered alerts re-prioritise, reassign and notify the next tier under the same rules as customer tickets.
Audit trail per incident. Every alert, every escalation, every customer conversation about a cert failure lives on one Tickiti ticket, indexed by hostname and alert key.
No flooding. The kill switch and the 5-minute dedup window keep a root-cause storm to a manageable handful of tickets, not hundreds of emails.
Structured reporting. The same dashboards that report on customer tickets cover ops alerts — how often each key fires, mean time to acknowledge, mean time to resolve, which alerts are loudest, which never fire.
No lost emails. The alert channel is the same channel the team already trusts for customer work.

Where to go next

Escalations — the queue-level rules that make alerts unmissable.
Create ticket API — the endpoint behind the alerting helper.
Multi-party approvals and Automated sales with manual review — the other two case studies in this section.