The phone vibrates at 3:47 AM. Does it have to?
You wake up before you hear the sound. You know what it is before you look at the screen. The container is dead, again, and yesterday evening you were working on exactly that service. By 4:20 AM you have restarted it and gone back to bed. The next day you work at fifty percent.
A one-hour session with one of our engineers identifies the recurring failure patterns in your current stack, the self-healing configuration suited to your case, and the expected effect on your mean time to recovery. You receive a written report. It stays yours, regardless of the next step.
Recovery measured in seconds.
When a container falls for any reason, Kubernetes creates a new one in seconds. Total out-of-service drops from minutes of manual restart to three seconds of automatic replacement.
Health verified before traffic.
Liveness probes detect frozen containers and replace them. Readiness probes verify a container is actually ready before routing user traffic to it. No more half-broken instances serving errors.
Nights that stay nights.
The vast majority of incidents that wake your team are transient failures of single instances. Self-healing absorbs them silently. The phone vibrates only for what actually requires a human.
The oversight is you.
What accumulates over time is not the frustration of the individual incident. It is the awareness that your systems require human oversight to stay up, and that the oversight is you.
The phone vibrates at 3:47 AM. You wake up before you hear the sound. You know what it is before you look at the screen. The container is dead, again, and yesterday evening you were working on exactly that service. You open the laptop with burning eyes. You search the logs. You find an OutOfMemoryError, or a deadlock, or an exception on an external call that stopped responding.
It is quarter to four, and at least twenty minutes will pass before you understand enough to decide whether a restart is enough or whether something more is needed. You restart. You wait for the metric to turn green. You send a message to the on-call channel with a quick summary. You go to bed at 4:20. The day after you work at fifty percent, and nobody runs the post-mortem because priority has already shifted elsewhere.
A week later the same container dies again, at a similar hour, for an apparently identical but actually different reason. Same procedure, same four in the morning. The cost is rarely tracked on any budget line. It is distributed across degraded productivity the day after, accumulated team fatigue, post-mortems that never happen, and the slow erosion of the engineers who do this for a living.
What changes when the system recovers itself.
Mean time to recovery drops to seconds.
From the minutes of a human restart to the seconds of an automatic replacement. The failure is logged, the recovery is logged, the users do not notice.
The on-call rotation stops eroding the team.
Engineers sleep the nights they are on-call. Those who are paged are paged for incidents that actually require a human, not for failures the system could have absorbed itself.
Post-mortems start happening.
When recoveries are automatic and logged, the team has time and bandwidth to investigate the underlying cause the next day. Patterns become visible. Fixes get scheduled instead of skipped.
Hiring conversations change tone.
Senior engineers ask about on-call expectations in interviews. Being able to answer that the rotation is a calm one, supported by self-healing infrastructure, is a real recruiting argument.
The phone does not have to vibrate.
The architecture that produces this behaviour stems from a reversal of the standard operational model. You no longer manage instances. You declare a desired state, and the system becomes responsible for keeping that state true at every moment.
The technology is native to Kubernetes. The implementation has three prerequisites. Whether your current stack meets them is the work of one focused session.
The mechanism, in plain terms.
You declare a desired state: three active copies of this application in production at all times. The component that enforces this is called ReplicaSet. If one of the three copies falls for any reason (process crash, node failure, external kill) the Kubernetes immediately creates a new one. To this are added two active health mechanisms on the individual instance. The liveness probe periodically queries the container with a request you define (an HTTP endpoint, a command, a TCP check), and if the container does not respond within the threshold, it terminates and recreates it. The readiness probe does the complementary job: before sending user traffic to a freshly started container, it verifies the container is actually ready to receive it.
Whether the configuration is viable in your infrastructure depends on three prerequisites.
Applications that can be restarted without losing critical state.
Self-healing works by killing and recreating containers. If your application holds critical state in memory or in the local filesystem, a restart loses it. Identifying which services are restart-safe, and adapting the ones that are not, is the first work of the migration.
Health endpoints that mean something.
A liveness probe that pings /health and always returns 200 is worse than no probe at all. The endpoint must report the actual state of the application, including dependencies on critical resources. Defining what healthy means for each service is observation work, not just configuration.
Thresholds calibrated to real failure patterns.
A too-aggressive threshold and the container gets restarted during legitimate slow operations. A too-permissive threshold and the failure stays in production for minutes. The correct values depend on the real behaviour of your application under load, and require a calibration session, not a default config.
When the three are aligned, the result is the configuration described above. When one is missing, the assessment identifies which one, and the path to address it.
What self-healing does not solve.
Self-healing solves exactly transient failures of the individual instance, which are the vast majority of real nightly incidents. The value is bounded, but it is large in the area where it applies. It is worth saying clearly what falls outside.
- Systemic application bugs. If your application always crashes on the same query, the system will keep restarting it in a loop. You have to fix the code.
- External dependencies that fail. If the database is down, restarting the application does not help. Recovery depends on whatever runs the database.
- Monitoring and observability. Self-healing replaces manual recovery, not the tools that tell you what is happening in your systems. The two work together.
None of these are abstract problems. Each one is a real piece of work with its own scope, methodology, and timeline. The assessment names them when they are present, and indicates which ones to address alongside or after the self-healing setup. The architecture is a foundation. It is not the whole building.
One hour. One engineer. One written report.
A working session, structured around your stack.
One hour with one of our engineers. Remote, scheduled within 72 hours of request. Paid engagement, billed independently of any work that may or may not follow.
The recurring failure patterns in your current stack.
The incidents from the last six months that woke your team. What was restarted, by whom, at what hour. Which services fall most often. Which ones hold critical state. The current health endpoint coverage, if any.
A written report, with a clear recommendation.
The recurring failure patterns identified. The self-healing configuration suited to your case. The thresholds and probes to implement. The expected effect on your mean time to recovery, with a realistic timeline.
The report belongs to you.
Use it internally, share it with your CTO, share it with another vendor. There is no obligation to proceed with Graftholders. The assessment is paid; the analysis is yours to keep regardless of the next step.
Why this works as a written report.
Most infrastructure proposals come from vendors selling a fixed configuration and adapting the diagnosis to fit it. Our model is the opposite: the diagnosis is the product. The intervention, if any, follows.
A written report does three things a conversation cannot. It survives the meeting. It can be circulated internally to your CTO, your Head of Platform, your on-call team. People who weren't in the room but whose nights depend on the analysis. And it commits the recommendation to writing, which is the only form in which operational decisions can be evaluated rigorously and turned into a roadmap.
The session is paid for a reason: it filters both sides. We commit a focused hour and a written deliverable to every case, which means we only run assessments where the case actually warrants one. In a meaningful portion of them, the conclusion is that the current configuration is the right fit, and no intervention is recommended now. We say so, in writing.
The report is yours. Whether you proceed with Graftholders, with another vendor, or with no vendor at all, the analysis travels with you.
Engagement by stage.
Every engagement starts with the assessment. Everything that follows is optional, activated only when your case requires it.
Assessment
Diagnosis before configuration. Failure pattern analysis, self-healing recommendation, expected effect on MTTR mapped in a written report. The outcome is a clear recommendation, including the option of leaving the system as it is when that is the right answer.
Configuration and Calibration
Kubernetes, liveness probes, and readiness probes configured on your cluster with the thresholds identified in the assessment. Health endpoints validated against the actual application state. Tested under fault injection. Fixed-fee delivery, three tiers based on complexity.
Team Enablement
One to two days. Your team learns to operate the system: how to read probe signals, how to adjust thresholds when patterns change, how to add probes to new services. On-site or remote. Per-edition pricing, independent of headcount up to a defined cap.
Maintenance Pack
Prepaid days, used on demand. For clients who want ongoing coverage without a monthly retainer. Probe recalibration, post-incident reviews, periodic failure pattern analysis. Packs of 5, 10, or 20 days, twelve-month validity.
Optional
Stack details
Applications installed through GitOps. Standard CNCF and open-source components. No proprietary forks. Configured during cluster setup, declared in version control, deployed and updated through GitOps.
Standard CNCF and open-source components. No proprietary forks. Configured during cluster setup, declared in version control, deployed and updated through GitOps.
Four client profiles.
Three project stages.
Different starting points, the same methodology. Select where you are today.
You think Kubernetes is overkill for where you are now?
For early-stage startups that don't yet know what they need, a full Kubernetes cluster can feel out of reach. We build a simple, working prototype that lets you ship and scale without committing to infrastructure you don't need yet. This is one of our most requested services.
Early-stage product in production
Early-stage company with an MVP in production. Limited budget, no dedicated DevOps. Cloud costs already a concern.
Cluster on low-cost European cloud
Production-ready Kubernetes cluster with integrated GitOps pipeline. Optional application development on top of the cluster.
Predictable European infrastructure
Engineering team aligned on the cluster from delivery. Maintenance available on demand if required.
Regulatory or geographic constraints
Data-residency, latency, or sector-specific compliance requirements (healthcare, finance, public sector). Hyperscaler defaults do not fit.
Bare metal cluster, constraint-driven
Kubernetes cluster on European bare metal. Architecture designed around regulatory and geographic constraints. Integrated GitOps pipeline.
Infrastructure built for your constraints
Built for the client's constraints, not for the provider's catalog. Predictable costs.
Established team, stable workload
Significant in-house development team. Stable workload, governance requirements, long-term infrastructure ownership.
Workload-matched architecture
Kubernetes cluster on European bare metal with architecture matched to actual workload patterns. GitOps as operational methodology. Optional retainer.
Governed by your team
Infrastructure governed by the client's team, not by an external vendor. A partner available on demand.
Multi-region complexity
Multi-region operations, varied latency requirements, differing regulatory environments, distributed engineering teams.
Multi-region, compliance integrated
Multi-region Kubernetes architecture on European bare metal. Compliance integrated by design (IEC 62443, CRA, NIS2, GDPR). Cross-region GitOps.
Single point of contact
Compliance built in from day one, not retrofitted. One interlocutor across the full delivery chain.
Your product can't run on Kubernetes as it is?
If your application is not cloud-native and can't be containerized in its current form, we rebuild it. We rewrite the parts that block the migration, keep the business logic intact, and deliver a system that runs on modern infrastructure from day one.
Different profile?
A one-hour assessment defines the case, the constraints, and the right path forward.
Request assessmentThree ways to run Kubernetes.
One that keeps the system yours.
Do it yourself
- You build the cluster from scratch
- Maintenance is entirely on your team
- Significant in-house expertise required
- Long time to production
- Costs scale steeply with usage
- Dependent on the cloud provider you happened to choose
Done for you
- Faster initial setup
- No control over architectural decisions
- No access to the repository
- Lock-in by design
- Costs scale steeply with traffic
- You depend on the provider's chosen cloud vendor
Done with you
- Hybrid model: speed of managed, control of in-house
- You stay involved in every architectural decision
- Full access to the repository
- No vendor lock-in. The cluster is yours
- We choose the cloud provider that fits your case
- Standard upstream Kubernetes, no proprietary fork
Technical capabilities.
Kubernetes, desired state
You declare three active copies; the system keeps three active copies, no matter what falls in between.
Liveness probes
Periodic health checks on each container. Frozen instances are detected and recreated automatically.
Readiness probes
Traffic routes only to containers that have verified they are ready to serve. No half-initialised instances answer your users.
Calibrated thresholds
Probe timing tuned to your application's real behaviour under load. No false positives, no missed failures.
Provider-agnostic
Works on European bare metal, European cloud, US hyperscalers, or on-premise. The mechanism is the same everywhere.
Team ownership transferred
Documentation and runbooks delivered with the configuration. Your team operates and recalibrates the system after handover.
Related reading.
Technical articles and scenario analysis on operational reliability and self-healing systems.
The phone that no longer vibrates
On the architectural pattern that absorbs the nightly incidents your team currently absorbs in person. What it solves, what it doesn't.
Read the article →Why your /health endpoint is lying to you
The endpoint that always returns 200 is worse than no endpoint at all. How to design probes that report the truth about your service.
Read the article →The hidden cost of the on-call rotation
What an exhausted engineer costs the company the day after. Why operational reliability is now a hiring argument, not just a technical one.
Read the article →Questions, answered directly.
We already have liveness and readiness probes configured. Is this still relevant?
Often yes. Having probes enabled and having probes that actually report the real state of the application are different things. The assessment evaluates whether your /health endpoint tells the truth, whether thresholds match real failure patterns, and whether Kubernetes behaviour is what you think it is.
What if our applications are not yet stateless?
The assessment identifies which services are restart-safe today and which ones need adaptation. The work to make a service restart-safe is sometimes minor (externalising session state to a cache) and sometimes substantial. The report indicates the path and the realistic effort for each case.
Will self-healing cause restart loops if the application is broken?
Yes, if the application has a systemic bug that crashes on the same input, the system will keep restarting it. Self-healing solves transient failures, not code defects. Backoff mechanisms and alerts on excessive restart rates are part of the configuration we deliver, so a real bug surfaces clearly instead of hiding behind constant restarts.
How long does the implementation typically take?
It depends on the number of services that need adaptation, the complexity of their health logic, and whether observability of failure patterns already exists. Probe configuration alone is a few days. The calibration cycle and the adaptation of stateful services extend it. The exact timeline is in the assessment report.
What does the configuration and calibration cost?
Fixed fee based on complexity. Three tiers, set by the number of services and the maturity of existing observability. The exact figure is defined after the assessment, based on what we actually find in your infrastructure.
What happens after the configuration?
The configuration, the documentation, and the operational knowledge belong to your team. Recalibration after major releases is purchased on demand. No monthly retainer, no contractual lock-in.
Is Graftholders limited to self-healing work?
No. Managed Kubernetes infrastructure, software development, and industrial cybersecurity (IEC 62443, CRA, NIS2, GDPR) are part of the broader offering. Clients engage through the most relevant entry point.
The next night can stay a night.
One hour with one of our engineers. A written report covering the recurring failure patterns in your current stack, the self-healing configuration suited to your case, the thresholds and probes to implement, and the expected effect on your mean time to recovery. The report is yours, regardless of what you decide to do with it.
Request assessment/ Response within 24 hours, with next steps
