Graftholders. The phone that no longer vibrates.

The phone vibrates at 3:47 AM. Does it have to?

You wake up before you hear the sound. You know what it is before you look at the screen. The container is dead, again, and yesterday evening you were working on exactly that service. By 4:20 AM you have restarted it and gone back to bed. The next day you work at fifty percent.

A one-hour session with one of our engineers identifies the recurring failure patterns in your current stack, the self-healing configuration suited to your case, and the expected effect on your mean time to recovery. You receive a written report. It stays yours, regardless of the next step.

Request assessment See how it works

Same incident / two infrastructures

Scenario A / manual recovery

Out of service: 33 minutes / Awake: 1

3:47

phone vibrates

3:55

open logs

4:05

identify cause

4:15

manual restart

4:20

back to bed

Scenario B / self-healing active

Out of service: 0 sec / people awake: 0

3:47:00

container dies

3:47:01

Kubernetes detects

3:47:02

new container starts

3:47:03

readiness ok

morning

you read the log

Same failure, two outcomes. In one, recovery costs a night. In the other, recovery costs three seconds and a line in the log.

Recovery measured in seconds.

When a container falls for any reason, Kubernetes creates a new one in seconds. Total out-of-service drops from minutes of manual restart to three seconds of automatic replacement.

Health verified before traffic.

Liveness probes detect frozen containers and replace them. Readiness probes verify a container is actually ready before routing user traffic to it. No more half-broken instances serving errors.

Nights that stay nights.

The vast majority of incidents that wake your team are transient failures of single instances. Self-healing absorbs them silently. The phone vibrates only for what actually requires a human.

The problem

The oversight is you.

What accumulates over time is not the frustration of the individual incident. It is the awareness that your systems require human oversight to stay up, and that the oversight is you.

The phone vibrates at 3:47 AM. You wake up before you hear the sound. You know what it is before you look at the screen. The container is dead, again, and yesterday evening you were working on exactly that service. You open the laptop with burning eyes. You search the logs. You find an OutOfMemoryError, or a deadlock, or an exception on an external call that stopped responding.

It is quarter to four, and at least twenty minutes will pass before you understand enough to decide whether a restart is enough or whether something more is needed. You restart. You wait for the metric to turn green. You send a message to the on-call channel with a quick summary. You go to bed at 4:20. The day after you work at fifty percent, and nobody runs the post-mortem because priority has already shifted elsewhere.

A week later the same container dies again, at a similar hour, for an apparently identical but actually different reason. Same procedure, same four in the morning. The cost is rarely tracked on any budget line. It is distributed across degraded productivity the day after, accumulated team fatigue, post-mortems that never happen, and the slow erosion of the engineers who do this for a living.

The shift

What changes when the system recovers itself.

// 01

Mean time to recovery drops to seconds.

From the minutes of a human restart to the seconds of an automatic replacement. The failure is logged, the recovery is logged, the users do not notice.

// 02

The on-call rotation stops eroding the team.

Engineers sleep the nights they are on-call. Those who are paged are paged for incidents that actually require a human, not for failures the system could have absorbed itself.

// 03

Post-mortems start happening.

When recoveries are automatic and logged, the team has time and bandwidth to investigate the underlying cause the next day. Patterns become visible. Fixes get scheduled instead of skipped.

// 04

Hiring conversations change tone.

Senior engineers ask about on-call expectations in interviews. Being able to answer that the rotation is a calm one, supported by self-healing infrastructure, is a real recruiting argument.

The phone does not have to vibrate.

The architecture that produces this behaviour stems from a reversal of the standard operational model. You no longer manage instances. You declare a desired state, and the system becomes responsible for keeping that state true at every moment.

The technology is native to Kubernetes. The implementation has three prerequisites. Whether your current stack meets them is the work of one focused session.

The mechanism

The mechanism, in plain terms.

You declare a desired state: three active copies of this application in production at all times. The component that enforces this is called ReplicaSet. If one of the three copies falls for any reason (process crash, node failure, external kill) the Kubernetes immediately creates a new one. To this are added two active health mechanisms on the individual instance. The liveness probe periodically queries the container with a request you define (an HTTP endpoint, a command, a TCP check), and if the container does not respond within the threshold, it terminates and recreates it. The readiness probe does the complementary job: before sending user traffic to a freshly started container, it verifies the container is actually ready to receive it.

Whether the configuration is viable in your infrastructure depends on three prerequisites.

Prerequisite 01

Applications that can be restarted without losing critical state.

Self-healing works by killing and recreating containers. If your application holds critical state in memory or in the local filesystem, a restart loses it. Identifying which services are restart-safe, and adapting the ones that are not, is the first work of the migration.

Prerequisite 02

Health endpoints that mean something.

A liveness probe that pings /health and always returns 200 is worse than no probe at all. The endpoint must report the actual state of the application, including dependencies on critical resources. Defining what healthy means for each service is observation work, not just configuration.

Prerequisite 03

Thresholds calibrated to real failure patterns.

A too-aggressive threshold and the container gets restarted during legitimate slow operations. A too-permissive threshold and the failure stays in production for minutes. The correct values depend on the real behaviour of your application under load, and require a calibration session, not a default config.

When the three are aligned, the result is the configuration described above. When one is missing, the assessment identifies which one, and the path to address it.

In scope, out of scope

What self-healing does not solve.

Self-healing solves exactly transient failures of the individual instance, which are the vast majority of real nightly incidents. The value is bounded, but it is large in the area where it applies. It is worth saying clearly what falls outside.

Systemic application bugs. If your application always crashes on the same query, the system will keep restarting it in a loop. You have to fix the code.
External dependencies that fail. If the database is down, restarting the application does not help. Recovery depends on whatever runs the database.
Monitoring and observability. Self-healing replaces manual recovery, not the tools that tell you what is happening in your systems. The two work together.

None of these are abstract problems. Each one is a real piece of work with its own scope, methodology, and timeline. The assessment names them when they are present, and indicates which ones to address alongside or after the self-healing setup. The architecture is a foundation. It is not the whole building.

The assessment

One hour. One engineer. One written report.

Card 01 / Format

A working session, structured around your stack.

One hour with one of our engineers. Remote, scheduled within 72 hours of request. Paid engagement, billed independently of any work that may or may not follow.

Card 02 / What we look at

The recurring failure patterns in your current stack.

The incidents from the last six months that woke your team. What was restarted, by whom, at what hour. Which services fall most often. Which ones hold critical state. The current health endpoint coverage, if any.

Card 03 / What you receive

A written report, with a clear recommendation.

The recurring failure patterns identified. The self-healing configuration suited to your case. The thresholds and probes to implement. The expected effect on your mean time to recovery, with a realistic timeline.

Card 04 / Ownership

The report belongs to you.

Use it internally, share it with your CTO, share it with another vendor. There is no obligation to proceed with Graftholders. The assessment is paid; the analysis is yours to keep regardless of the next step.

The method

Why this works as a written report.

Most infrastructure proposals come from vendors selling a fixed configuration and adapting the diagnosis to fit it. Our model is the opposite: the diagnosis is the product. The intervention, if any, follows.

A written report does three things a conversation cannot. It survives the meeting. It can be circulated internally to your CTO, your Head of Platform, your on-call team. People who weren't in the room but whose nights depend on the analysis. And it commits the recommendation to writing, which is the only form in which operational decisions can be evaluated rigorously and turned into a roadmap.

The session is paid for a reason: it filters both sides. We commit a focused hour and a written deliverable to every case, which means we only run assessments where the case actually warrants one. In a meaningful portion of them, the conclusion is that the current configuration is the right fit, and no intervention is recommended now. We say so, in writing.

The report is yours. Whether you proceed with Graftholders, with another vendor, or with no vendor at all, the analysis travels with you.

How it works

Engagement by stage.

Every engagement starts with the assessment. Everything that follows is optional, activated only when your case requires it.

Required / Step 01

Assessment

Diagnosis before configuration. Failure pattern analysis, self-healing recommendation, expected effect on MTTR mapped in a written report. The outcome is a clear recommendation, including the option of leaving the system as it is when that is the right answer.

Optional

Configuration and Calibration

Kubernetes, liveness probes, and readiness probes configured on your cluster with the thresholds identified in the assessment. Health endpoints validated against the actual application state. Tested under fault injection. Fixed-fee delivery, three tiers based on complexity.

Optional

Team Enablement

One to two days. Your team learns to operate the system: how to read probe signals, how to adjust thresholds when patterns change, how to add probes to new services. On-site or remote. Per-edition pricing, independent of headcount up to a defined cap.

Optional

Maintenance Pack

Prepaid days, used on demand. For clients who want ongoing coverage without a monthly retainer. Probe recalibration, post-incident reviews, periodic failure pattern analysis. Packs of 5, 10, or 20 days, twelve-month validity.

Optional

Stack details

Applications installed through GitOps. Standard CNCF and open-source components. No proprietary forks. Configured during cluster setup, declared in version control, deployed and updated through GitOps.

Standard CNCF and open-source components. No proprietary forks. Configured during cluster setup, declared in version control, deployed and updated through GitOps.

GitOps engine

Flux · Argo CD

Databases

PostgreSQL · MongoDB · ClickHouse · MariaDB · MySQL · Redis · Elasticsearch · CockroachDB

Messaging

NATS · RabbitMQ · Kafka

Monitoring

Grafana · OpenTelemetry

Where you stand

Four client profiles.
Three project stages.

Different starting points, the same methodology. Select where you are today.

You think Kubernetes is overkill for where you are now?

For early-stage startups that don't yet know what they need, a full Kubernetes cluster can feel out of reach. We build a simple, working prototype that lets you ship and scale without committing to infrastructure you don't need yet. This is one of our most requested services.

Situation

Early-stage product in production

Early-stage company with an MVP in production. Limited budget, no dedicated DevOps. Cloud costs already a concern.

What we build

Cluster on low-cost European cloud

Production-ready Kubernetes cluster with integrated GitOps pipeline. Optional application development on top of the cluster.

Outcome

Predictable European infrastructure

Engineering team aligned on the cluster from delivery. Maintenance available on demand if required.

Situation

Regulatory or geographic constraints

Data-residency, latency, or sector-specific compliance requirements (healthcare, finance, public sector). Hyperscaler defaults do not fit.

What we build

Bare metal cluster, constraint-driven

Kubernetes cluster on European bare metal. Architecture designed around regulatory and geographic constraints. Integrated GitOps pipeline.

Outcome

Infrastructure built for your constraints

Built for the client's constraints, not for the provider's catalog. Predictable costs.

Situation

Established team, stable workload

Significant in-house development team. Stable workload, governance requirements, long-term infrastructure ownership.

What we build

Workload-matched architecture

Kubernetes cluster on European bare metal with architecture matched to actual workload patterns. GitOps as operational methodology. Optional retainer.

Outcome

Governed by your team

Infrastructure governed by the client's team, not by an external vendor. A partner available on demand.

Situation

Multi-region complexity

Multi-region operations, varied latency requirements, differing regulatory environments, distributed engineering teams.

What we build

Multi-region, compliance integrated

Multi-region Kubernetes architecture on European bare metal. Compliance integrated by design (IEC 62443, CRA, NIS2, GDPR). Cross-region GitOps.

Outcome

Single point of contact

Compliance built in from day one, not retrofitted. One interlocutor across the full delivery chain.

Your product can't run on Kubernetes as it is?

If your application is not cloud-native and can't be containerized in its current form, we rebuild it. We rewrite the parts that block the migration, keep the business logic intact, and deliver a system that runs on modern infrastructure from day one.

Different profile?

A one-hour assessment defines the case, the constraints, and the right path forward.

Request assessment

Compared

Three ways to run Kubernetes.
One that keeps the system yours.

Approach A

Do it yourself

Hyperscaler raw infrastructure

You build the cluster from scratch
Maintenance is entirely on your team
Significant in-house expertise required
Long time to production
Costs scale steeply with usage
Dependent on the cloud provider you happened to choose

Approach B

Done for you

Managed Kubernetes providers

Faster initial setup
No control over architectural decisions
No access to the repository
Lock-in by design
Costs scale steeply with traffic
You depend on the provider's chosen cloud vendor

Recommended

Approach C

Done with you

Graftholders

Hybrid model: speed of managed, control of in-house
You stay involved in every architectural decision
Full access to the repository
No vendor lock-in. The cluster is yours
We choose the cloud provider that fits your case
Standard upstream Kubernetes, no proprietary fork

Key features

Technical capabilities.

Kubernetes, desired state

You declare three active copies; the system keeps three active copies, no matter what falls in between.

Liveness probes

Periodic health checks on each container. Frozen instances are detected and recreated automatically.

Readiness probes

Traffic routes only to containers that have verified they are ready to serve. No half-initialised instances answer your users.

Calibrated thresholds

Probe timing tuned to your application's real behaviour under load. No false positives, no missed failures.

Provider-agnostic

Works on European bare metal, European cloud, US hyperscalers, or on-premise. The mechanism is the same everywhere.

Team ownership transferred

Documentation and runbooks delivered with the configuration. Your team operates and recalibrates the system after handover.

Insights

Related reading.

Technical articles and scenario analysis on operational reliability and self-healing systems.

The phone that no longer vibrates

On the architectural pattern that absorbs the nightly incidents your team currently absorbs in person. What it solves, what it doesn't.

Read the article →

// Health endpoints

Why your /health endpoint is lying to you

The endpoint that always returns 200 is worse than no endpoint at all. How to design probes that report the truth about your service.

Read the article →

The hidden cost of the on-call rotation

What an exhausted engineer costs the company the day after. Why operational reliability is now a hiring argument, not just a technical one.

Read the article →

See all articles →

FAQ

Questions, answered directly.

We already have liveness and readiness probes configured. Is this still relevant?

Often yes. Having probes enabled and having probes that actually report the real state of the application are different things. The assessment evaluates whether your /health endpoint tells the truth, whether thresholds match real failure patterns, and whether Kubernetes behaviour is what you think it is.

What if our applications are not yet stateless?

The assessment identifies which services are restart-safe today and which ones need adaptation. The work to make a service restart-safe is sometimes minor (externalising session state to a cache) and sometimes substantial. The report indicates the path and the realistic effort for each case.

Will self-healing cause restart loops if the application is broken?

Yes, if the application has a systemic bug that crashes on the same input, the system will keep restarting it. Self-healing solves transient failures, not code defects. Backoff mechanisms and alerts on excessive restart rates are part of the configuration we deliver, so a real bug surfaces clearly instead of hiding behind constant restarts.

How long does the implementation typically take?

It depends on the number of services that need adaptation, the complexity of their health logic, and whether observability of failure patterns already exists. Probe configuration alone is a few days. The calibration cycle and the adaptation of stateful services extend it. The exact timeline is in the assessment report.

What does the configuration and calibration cost?

Fixed fee based on complexity. Three tiers, set by the number of services and the maturity of existing observability. The exact figure is defined after the assessment, based on what we actually find in your infrastructure.

What happens after the configuration?

The configuration, the documentation, and the operational knowledge belong to your team. Recalibration after major releases is purchased on demand. No monthly retainer, no contractual lock-in.

Is Graftholders limited to self-healing work?

No. Managed Kubernetes infrastructure, software development, and industrial cybersecurity (IEC 62443, CRA, NIS2, GDPR) are part of the broader offering. Clients engage through the most relevant entry point.

The next night can stay a night.

One hour with one of our engineers. A written report covering the recurring failure patterns in your current stack, the self-healing configuration suited to your case, the thresholds and probes to implement, and the expected effect on your mean time to recovery. The report is yours, regardless of what you decide to do with it.

Request assessment

/ Response within 24 hours, with next steps

Contacts

Social Networks

Graftholders

Decluttering Cybersecurity Complexity
Embracing Security