Troubleshooting Unreachable Kubernetes Pods

Explore top LinkedIn content from expert professionals.

Summary

Troubleshooting unreachable Kubernetes pods means identifying and fixing problems that prevent pods—tiny compute units in a container orchestration system—from communicating or working as expected. This process involves checking logs, examining pod details, and investigating network configurations to find the source of the issue.

  • Check pod logs: Start by reviewing the logs of the affected pod to look for errors or misconfigurations that might explain why it’s not reachable.
  • Review pod details: Use commands to describe the pod and view recent events, which can reveal issues like probe failures, resource shortages, or crashes.
  • Test network connections: Confirm that services and endpoints are set up properly and use debugging tools to check if pods can communicate within the cluster.
Summarized by AI based on LinkedIn member posts
  • View profile for Deepak Agrawal

    Founder & CEO @ Infra360 | DevOps, FinOps & CloudOps Partner for FinTech, SaaS & Enterprises

    18,734 followers

    I use this simple 3-step logs flow that helps me debug almost anything in Kubernetes under 30 minutes. 𝗦𝘁𝗲𝗽 1 → kubectl logs <pod> Ask: “Did the app fail inside the container?” If the pod is up, this is your first stop. Look for stack traces, startup errors, misconfigs. But if logs show nothing (or the pod never started), move on fast. 𝗦𝘁𝗲𝗽 2 → kubectl describe pod <pod> Ask: “Did Kubernetes kill the pod?” This one’s underrated. It shows you probe failures, CrashLoops, image pull issues, and mount errors. Basically, if K8s is mad at your pod, this will tell you why. 𝗦𝘁𝗲𝗽 3 → kubectl get events --sort-by=.metadata.creationTimestamp Ask: “What else is breaking in the cluster?” This is your timeline. It shows broader issues: node pressure, CNI problems, preemptions. If the problem isn’t in logs or describe, this one usually holds the clue. This is the exact flow we use inside incident war rooms. ➤ If the pod is running → check logs. ➤ If it’s crashing or pending → check describe. ➤ If you’re still lost → check events. Don’t waste 45 minutes staring at Grafana hoping something makes sense. Start with the logs. Ask better questions. Fix faster. I built a 1-page cheatsheet of this debugging flow. It’s part of our SRE onboarding at Infra360. Want it? Drop a “LOGS” in the comments and I’ll send it to you.

  • View profile for ☁ Richard Hooper

    Principal Cloud Architect @ Intercept | Azure Kubernetes Service (AKS) | Azure MVP | Author

    8,576 followers

    Microsoft just shipped Container Network Insight Agent for AKS in public preview and it is genuinely worth your attention. I caught this one via the AKS Docs Tracker I run at pixelrobots.co.uk, which watches for new and updated docs across the AKS documentation. This one stood out immediately. Put simply, it is an AI-powered network diagnostics assistant that runs as a pod inside your cluster. You describe the problem in plain English. It collects evidence using kubectl, cilium, and hubble, then returns a structured report with root cause analysis and the exact commands to fix the issue. It covers: DNS failures (CoreDNS, NodeLocal DNS, Cilium FQDN egress) Packet drops (NIC ring buffers, kernel softnet stats, SoftIRQ, socket buffers) Kubernetes networking (network policies, service endpoint mismatches, Hubble flow analysis) No more bouncing between tools trying to piece together what went wrong. If you are already running ACNS you get the full feature set including Hubble flow analysis and Cilium policy diagnostics. If not, the DNS and packet drop coverage alone is worth a look. It might also be the nudge to finally enable ACNS properly. At around $0.025 per node per hour the observability stack you get is hard to argue with. Full walkthrough covering all 8 setup steps, both the ACNS and non-ACNS install paths, and a troubleshooting reference here: https://lnkd.in/eG63jje6 #AKS #Azure #Kubernetes #CloudNative #DevOps

  • View profile for Prabhat Sharma

    Founder @ OpenObserve | Open source Observability | Helping engineering teams scale observability without the data tax | Cloud Native & Container Specialist

    9,105 followers

    The pods were OOMing, and the engineering team was adamant: "We didn't change a thing." This was back during my time at AWS. A customer's production was effectively halted, stuck in a restart loop. I hopped on a call with customer's engineering and infra team. The problem with these incidents is the constraint of time. You can’t learn a stranger's complex application logic in 60 minutes. It’s impossible to debug the code effectively without deep domain knowledge, and we didn't have the luxury of time. But as an architect, you don't always need to fix the code to stop the bleeding. You just need to control the physics of the infrastructure. I stopped trying to understand why the app was crashing and looked at how it was deployed. When I checked the manifest: No resource requests. No limits. ⚠️ The Kubernetes scheduler was flying blind. It was placing memory-hungry pods on nodes that couldn't handle the unexpected spikes, causing cascading failures across the cluster. I told the team: "I don't know the specific bug causing this memory pressure, and I can't fix that right now. But I can make sure the infrastructure survives it so you have time to debug." We implemented two changes immediately: 1. Set hard requests and limits. 2. Enabled the Horizontal Pod Autoscaler (HPA). The effect was immediate. Instead of crashing the nodes or starving neighbors, the individual pods were constrained. When load spiked, HPA spun up more replicas rather than letting a single instance bloat until it died. 🛡️ The system stabilized. The bleeding stopped. Did this burn more compute? Absolutely. The bill went up because we threw infrastructure at an application inefficiency. But that extra cost was the price of survival. A few days later, a box of chocolates showed up at the office, sent directly from the CEO. The lesson here isn't that K8s configuration is magic. It's that good architecture buys you time. Resilience isn't about writing bug-free code; it's about building a system that can survive the bugs you inevitably write. 🏗️ #Kubernetes #SRE #AWS #SystemDesign #OpenObserve

  • View profile for Jayas Balakrishnan

    Director Solutions Architecture & Hands-On Technical/Engineering Leader | 8x AWS, KCNA, KCSA & 3x GCP Certified | Multi-Cloud

    3,051 followers

    𝗞𝘂𝗯𝗲𝗿𝗻𝗲𝘁𝗲𝘀 𝗡𝗲𝘁𝘄𝗼𝗿𝗸 𝗧𝗿𝗼𝘂𝗯𝗹𝗲𝘀𝗵𝗼𝗼𝘁𝗶𝗻𝗴: 𝗔 𝗦𝘆𝘀𝘁𝗲𝗺𝗮𝘁𝗶𝗰 𝗔𝗽𝗽𝗿𝗼𝗮𝗰𝗵 Kubernetes networking issues can be complex, but following a structured methodology makes diagnosis efficient and effective. Here's my proven approach for troubleshooting connectivity problems across pods, services, and external endpoints. 𝗧𝗵𝗲 𝗟𝗮𝘆𝗲𝗿-𝗯𝘆-𝗟𝗮𝘆𝗲𝗿 𝗠𝗲𝘁𝗵𝗼𝗱𝗼𝗹𝗼𝗴𝘆 𝟭. 𝗣𝗼𝗱-𝗟𝗲𝘃𝗲𝗹 𝗖𝗼𝗻𝗻𝗲𝗰𝘁𝗶𝘃𝗶𝘁𝘆 Start at the foundation. Verify pod networking by checking if pods can communicate within the same node, then across nodes. This isolates whether the issue is at the container runtime level or higher up the stack. 𝟮. 𝗦𝗲𝗿𝘃𝗶𝗰𝗲 𝗗𝗶𝘀𝗰𝗼𝘃𝗲𝗿𝘆 𝗮𝗻𝗱 𝗥𝗲𝘀𝗼𝗹𝘂𝘁𝗶𝗼𝗻 Test DNS resolution within the cluster. Services rely on CoreDNS for name resolution, and many connectivity issues stem from DNS misconfigurations or service endpoint problems. 𝟯. 𝗦𝗲𝗿𝘃𝗶𝗰𝗲-𝘁𝗼-𝗣𝗼𝗱 𝗠𝗮𝗽𝗽𝗶𝗻𝗴 Examine whether services are correctly routing traffic to healthy pod endpoints. Check service selectors, endpoint objects, and pod labels for mismatches. 𝟰. 𝗡𝗲𝘁𝘄𝗼𝗿𝗸 𝗣𝗼𝗹𝗶𝗰𝘆 𝗘𝗻𝗳𝗼𝗿𝗰𝗲𝗺𝗲𝗻𝘁 Review network policies that might be blocking traffic. Default-deny policies can catch teams off guard, especially in security-hardened environments. 𝟱. 𝗜𝗻𝗴𝗿𝗲𝘀𝘀 𝗮𝗻𝗱 𝗟𝗼𝗮𝗱 𝗕𝗮𝗹𝗮𝗻𝗰𝗲𝗿 Configuration For external traffic, verify ingress controllers and load balancer configurations are correctly routing traffic to backend services. 𝟲. 𝗖𝗡𝗜 𝗮𝗻𝗱 𝗜𝗻𝗳𝗿𝗮𝘀𝘁𝗿𝘂𝗰𝘁𝘂𝗿𝗲 𝗜𝘀𝘀𝘂𝗲𝘀 Finally, investigate Container Network Interface plugin issues, node networking problems, or underlying infrastructure connectivity. 𝗞𝗲𝘆 𝗗𝗶𝗮𝗴𝗻𝗼𝘀𝘁𝗶𝗰 𝗤𝘂𝗲𝘀𝘁𝗶𝗼𝗻𝘀  • Can pods reach each other by IP address?  • Are service endpoints populated correctly?  • Is DNS resolution working within the cluster?  • Are network policies allowing the required traffic?  • Are ingress rules configured properly?  • Is the CNI plugin functioning correctly? 𝗙𝗼𝗿 𝗙𝗮𝘀𝘁𝗲𝗿 𝗥𝗲𝘀𝗼𝗹𝘂𝘁𝗶𝗼𝗻 𝗪𝗼𝗿𝗸 𝗕𝗼𝘁𝘁𝗼𝗺-𝗨𝗽: Start with basic IP connectivity before moving to higher-level abstractions like services and ingress. 𝗨𝘀𝗲 𝗧𝗲𝗺𝗽𝗼𝗿𝗮𝗿𝘆 𝗗𝗲𝗯𝘂𝗴 𝗣𝗼𝗱𝘀: Deploy debug containers in the same namespace to test connectivity without affecting production workloads. 𝗖𝗵𝗲𝗰𝗸 𝗠𝘂𝗹𝘁𝗶𝗽𝗹𝗲 𝗟𝗮𝘆𝗲𝗿𝘀 𝗦𝗶𝗺𝘂𝗹𝘁𝗮𝗻𝗲𝗼𝘂𝘀𝗹𝘆: Network issues often involve multiple components, so don't stop at the first problem you find. 𝗗𝗼𝗰𝘂𝗺𝗲𝗻𝘁 𝗬𝗼𝘂𝗿 𝗙𝗶𝗻𝗱𝗶𝗻𝗴𝘀: Keep track of what works and what doesn't to identify patterns across similar issues. This systematic approach (with a good cup of coffee) has helped me resolve network issues 3x faster than ad-hoc troubleshooting. The key is following the methodology consistently, even when you think you know where the problem lies. #AWS #awscommunity #kubernetes

  • View profile for Akhilesh Mishra

    Founder LivingDevops | DevOps Lead | Real-World Devops Educator | Mentor | 52k Linkedin | 22k Twitter | 12K Medium | | Tech Writer | Help people get into DevOps

    53,000 followers

    Production Kubernetes cluster is down. Your manager is asking for updates every 5 minutes. Here’s your step-by-step troubleshooting playbook: Step 1: Get your bearings Check where you are: kubectl config current-context See all contexts: kubectl config get-contexts Switch if needed: kubectl config use-context name List namespaces: kubectl get ns Step 2: See the big picture Node health: kubectl get nodes All pods: kubectl get pods -A Recent events: kubectl get events –sort-by=.metadata.creationTimestamp -A This tells you if it’s a cluster-wide issue or isolated problem. Step 3: Focus on the failing pod Get details: kubectl describe pod podname -n namespace Check logs: kubectl logs podname -n namespace Get inside: kubectl exec -it podname -n namespace – /bin/sh Step 4: Check health probes Look for probe failures in the describe output Test probe endpoint: kubectl exec -it podname -n namespace – curl localhost:port/health Step 5: Check deployments and rollouts Rollout status: kubectl rollout status deployment/name -n namespace View history: kubectl rollout history deployment/name -n namespace Rollback: kubectl rollout undo deployment/name -n namespace Step 6: Verify networking List services: kubectl get svc -n namespace Check endpoints: kubectl get endpoints -n namespace Test DNS: kubectl exec -it podname – nslookup servicename Step 7: Quick fixes that work Restart deployment: kubectl rollout restart deployment/name -n namespace Delete problematic pod: kubectl delete pod podname -n namespace The key is following the steps in order, not jumping around randomly.

  • 𝑲𝒖𝒃𝒆𝒓𝒏𝒆𝒕𝒆𝒔 𝑫𝒆𝒃𝒖𝒈𝒈𝒊𝒏𝒈 𝒎𝒂𝒅𝒆 𝒆𝒂𝒔𝒚 𝒖𝒔𝒊𝒏𝒈 𝒌𝒖𝒃𝒆𝒄𝒕𝒍 𝒑𝒍𝒖𝒈𝒊𝒏𝒔 ☸ Plugins extend kubectl with new sub-commands, allowing for new and custom features not included in the main distribution of kubectl. Krew, a plugin manager maintained by the Kubernetes SIG CLI community helps in adopting the plugins that are 'life-savior' debugging & troubleshooting tools Let's explore the popular ones : ✅ 𝐤𝐮𝐛𝐞𝐜𝐭𝐥-𝐝𝐞𝐛𝐮𝐠 (https://lnkd.in/gj8aJXpr) ✍ 'Out-of-tree' solution for connecting to and troubleshooting an existing, running, 'target' container in an existing pod in a #kubernetes cluster. The target container may have a shell and busybox utils and hence provide some debug capability ✅ 𝐰𝐢𝐧𝐝𝐨𝐰𝐬-𝐝𝐞𝐛𝐮𝐠 (https://lnkd.in/g9RYPDrK) ✍ Launches a #Windows host process pod with #debugging tools that gives access to the node ✅ 𝐤𝐮𝐛𝐞𝐜𝐭𝐥-𝐯𝐢𝐞𝐰-𝐚𝐥𝐥𝐨𝐜𝐚𝐭𝐢𝐨𝐧𝐬 (https://lnkd.in/gnZ5pdxQ) ✍ Lists allocations for resources (cpu, memory, gpu,...) as defined into the manifest of nodes and running pods ✅ 𝐤𝐮𝐛𝐞𝐜𝐭𝐥-𝐭𝐫𝐞𝐞 (https://lnkd.in/gZJPyqT4) ✍ Browse #k8s object hierarchies as a tree ✅ 𝐤𝐮𝐛𝐞𝐭𝐚𝐩 (https://lnkd.in/gWdwfzDS) ✍ Enables an operator to easily deploy intercepting proxies for Kubernetes Services to debug failed #networking connections ✅ 𝐤𝐬𝐭𝐫𝐚𝐜𝐞 (https://lnkd.in/gpYrJXqT) ✍ Collects strace data from Pods running in a hashtag k8s cluster. Watches and reviews system-calls from processes running inside Pods ✅ 𝐤𝐮𝐛𝐞𝐬𝐩𝐲 (https://lnkd.in/gZ65QuQj) ✍ Debug a running pod by creating a short-lived spy container, using specified image containing all the required debugging tools, to "spy" the target container by joining its OS namespaces ✅ 𝐤𝐮𝐛𝐞𝐜𝐭𝐥-𝐬𝐢𝐜𝐤-𝐩𝐨𝐝𝐬 (https://lnkd.in/gQ3fPttG) ✍ Diagnosis running pods that are sick! ✅ 𝐤𝐭𝐨𝐩 (https://lnkd.in/g_BXeBCS) ✍ A 'top' like tool that that displays useful metrics information about nodes, pods, and other workload resources running in a Kubernetes cluster ✅ 𝐤𝐮𝐛𝐞𝐜𝐭𝐥-𝐰𝐢𝐧𝐝𝐮𝐦𝐩𝐬 (https://lnkd.in/gJDcDwv8) ✍ Network traffic capture analyzer in #AKS #EKS Windows Nodes ✅ 𝐤𝐮𝐛𝐞𝐜𝐭𝐥-𝐠𝐫𝐚𝐩𝐡 (https://lnkd.in/gpQeTqwm) ✍ Visualizes Kubernetes resources and relationships!! What's your go-to tool for #troubleshooting 🙂 #devops #sre #platform #engineering #opensource #linux #sysadmin #developer #microservices

  • View profile for Poojitha A S

    DevOps | SRE | Kubernetes | AWS | Azure | MLOps 🔗 Visit my website: poojithaas.com

    7,262 followers

    #DAY112 I’m back from vacation and will resume regular posting! #Advanced Kubernetes Commands Every Expert Should Know: Debugging, Dry Runs, and More! 1. Dry Run for Resource Validation A dry run is essential for validating YAML files and configurations before applying them to your cluster. kubectl apply -f mydeployment.yaml --dry-run=client What it does: Simulates resource creation locally and flags potential issues without making any changes. 2. Debugging Pods on the Fly If a pod or container fails, kubectl debug lets you create a temporary debugging container within your pod for troubleshooting. kubectl debug -it mypod --image=busybox --target=containerName What it does: Starts a debugging session inside the pod using a different image, without interrupting the application. It's great for inspecting logs, files, or running isolated commands. 3. Quickly Editing Resources with kubectl edit Rather than editing YAML files manually, kubectl edit allows you to make live changes directly from the command line. kubectl edit deployment mydeployment What it does: Opens the resource configuration in your default editor (e.g., vim or nano) for quick editing. 4. Rolling Back Deployments To quickly revert to a previous version after a failed deployment: kubectl rollout undo deployment/mydeployment What it does: Rolls back to the last successful deployment, minimizing downtime. 5. Tail Logs from a Specific Pod Container When debugging, it’s crucial to view logs from the container causing issues. Instead of filtering through multiple containers, target the specific one. kubectl logs -f mypod -c mycontainer What it does: Streams logs from a specific container inside the pod for easier debugging. 6. Setting Resource Limits on the Fly Use kubectl set resources to adjust resource limits for a running pod, helpful for debugging resource constraints. kubectl set resources deployment mydeployment --limits=cpu=500m,memory=256Mi What it does: Sets CPU and memory limits for your deployment to see how it performs under different resource conditions. 7. Get Pod Events to Track Down Issues Events give you insights into what’s happening behind the scenes. Use kubectl get events to track issues like scheduling problems or failed probes. kubectl get events --field-selector involvedObject.name=mypod What it does: Filters events related to a specific pod, helping to identify problems. Pro Tip: Combine kubectl describe with kubectl get events for more thorough troubleshooting insights. TL;DR: These Kubernetes commands are essential for experts who want to: Simulate and validate changes Debug containers quickly Edit live resources Roll back deployments Gather logs and events for precise troubleshooting Master these commands to optimize your Kubernetes operations! 🌟 #Kubernetes #DevOps #Cloud #K8S #AdvancedCommands #CloudNative #Debugging #DryRun #DevOpsTools #KubernetesTips

  • In a recent interview for a DevOps engineer position, I was presented with a scenario that highlighted the importance of troubleshooting skills in Kubernetes environments. The question posed was: "There are two pods in a k8s node of the same configurations, one pod is crashing, but another pod is working fine. What can be the issue and how will you troubleshoot it?" This scenario is a common challenge faced by DevOps professionals working with Kubernetes. Identifying the root cause of pod crashes requires a systematic approach and a deep understanding of Kubernetes architecture. Here's a brief overview of how I would approach this issue: - **Potential Issues**: - Resource Allocation: The crashing pod might be exceeding resource limits. - Network Interference: Issues with network connectivity can lead to pod failures. - Application Bugs: Software issues within the crashing pod may cause failures. - Environmental Factors: External dependencies or misconfigurations can impact pod stability. - **Troubleshooting Steps**: - Check Pod Logs: Reviewing pod logs can provide insights into the crash reasons. - Resource Monitoring: Analyze resource utilization to identify any anomalies. - Network Diagnostics: Verify network settings and connectivity for both pods. - Debugging Tools: Utilize Kubernetes diagnostic tools for further investigation. Effective troubleshooting in Kubernetes requires a combination of technical expertise, attention to detail, and a methodical approach. By addressing these potential issues systematically, DevOps engineers can efficiently resolve pod crashes and ensure the stability of Kubernetes deployments.

  • View profile for Vishakha Sadhwani

    Sr. Solutions Architect at Nvidia | Ex-Google, AWS | 150k+ Linkedin | EB1-A Recipient || Opinions, my own ||

    151,703 followers

    10 Cloud DevOps troubleshooting scenarios you can't skip (and their resolution strategies) 1. Diagnosing High Latency in a Cloud-Native Application (Performance) → Check Cloud Specifics Monitoring dashboard or Grafana metrics → Analyze API Gateway latency (if it's a part of your app) → Inspect database queries and response time Note: Begin with metric analysis before log investigation 2. Kubernetes Pod in CrashLoopBackOff → Run kubectl logs <pod> for error messages → Use kubectl describe pod to check events → Validate environment variables, image version, and resource limits Note: Misconfigurations and missing dependencies are common causes 3. Broken CI/CD Pipeline → Review pipeline logs (GitHub Actions, Jenkins, etc.) → Validate secrets, tokens, and environment variables → Check for failed dependencies or syntax errors Note: Testing workflows locally helps catch silent failures 4. Publicly Exposed Storage Bucket (e.g., S3, GCS etc) → Audit bucket permissions and IAM policies → Block public access and review access control lists → Enable encryption and logging for monitoring Note: Always follow least-privilege access principles 5. Terraform Apply Failure → Review error messages for plan/apply mismatches → Check state file locks, syntax errors, or version conflicts → Validate changes before applying Note: Always run terraform plan to preview updates 6. Failed Kubernetes(Eg. EKS, AKS, or GKE) Deployment → Validate Helm chart values and image tags → Check node availability, taints, and resource limits → Use kubectl get events for insights Note: Misconfigured YAML is a frequent root cause 7. Unexpected Cloud Cost Spike → Use the billing dashboard and cost explorer → Identify idle or over-provisioned resources (compute, volume, Load Balancers) → Review autoscaling settings and storage tiers Note: Set alerts and budgets to catch anomalies early 8. Broken Blue-Green Deployment → Verify routing in load balancer or DNS → Check application health on the green environment → Ensure environment variables and secrets match Note: Always test green thoroughly before rerouting traffic There are way more real-world scenarios than what I’ve shared here (plus, I’ve hit the character limit on LinkedIn 😅 ) — so I’m putting together a list of Cloud DevOps troubleshooting cases I’ve come across in today’s newsletter. Subscribe here to get it in your inbox when it’s live: https://lnkd.in/dBNJPv9U • • • If you found this helpful, follow me (Vishakha Sadhwani) for more Cloud & DevOps insights through my newsletter — and feel free to share it so others can learn too!

  • View profile for Nagarjuna Reddy

    Sr SRE & Platform Engineer | AIOps •GenAI/LLM Infra | DevSecOps | AWS •Azure •Kubernetes •Terraform •GitOps | IEEE Senior Member •Sigma Xi •Forbes Tech Council | Tech Author/Researcher | High-Scale Production Systems

    3,057 followers

    This is one of the most frustrating Kubernetes moments. You check everything: Pods are running. Deployments are successful. Logs look clean. Still… users cannot access the application. After spending hours debugging, you realize: •The problem was never the app. •It was how you exposed it. Here is the clarity most people wish they had earlier: 📦 ClusterIP: - Your app works perfectly inside the cluster. - But from outside, it simply does not exist. 🌐 NodePort: - You can access it using node IP and port. - Works for testing, but feels messy and limited. ⚖️ LoadBalancer: - Now your app is reachable with a public IP. - This is what most production setups rely on. 🚪 Ingress: - Not just exposure, but control. - Routing with domains, paths, and HTTPS, all in one place. The real problem: Most developers focus on making the app run. But Kubernetes requires you to also design how it is accessed. Running ≠ Reachable A simple way to think about it: Internal traffic → ClusterIP Quick access → NodePort Production access → LoadBalancer Controlled routing → Ingress Once you understand this, you stop guessing and start solving. #Kubernetes #DevOps #CloudComputing #CloudNative #Networking #SRE #PlatformEngineering

Explore categories