Site Reliability Engineer

Remote (Manhattan, NY, US)Remote (region-locked)Individual contributor$145k–$200kvia jobspy_indeed

awsgcpazurekubernetesterraformpulumicloudformation

Don't apply into the void.

Most applications for this Bigstone Health Commission role vanish into an ATS. With jobfinder-ai, your agent finds the actual hiring manager or founder behind this opening and sends a tailored email from your own inbox — so a real person reads your pitch and replies. We then follow up until you land on the calendar.

Reach the decision-maker — $5

About the role

**Job Type:**

full time \|

**Location:**

33 Irving Pl, Manhattan, New York, United States **Post Date:** 2026\-06\-13T00:00 **Closing Date:** 2026\-12\-31T12:00 Job Available

**Job Title:** Site Reliability Engineer (SRE) **Employment Type:** Full\-Time **Work Model:** Remote (US\-Based) · Optional Hybrid in New York, NY **Compensation:** $145,000 – $200,000 USD per year **Reports To:** Co\-Founder / CEO

About LockedIn AI LockedIn AI is a fast\-growing AI career technology company trusted by over 1 million users worldwide. We build a real\-time AI interview and meeting copilot that supports users during high\-stakes professional moments such as interviews, coding assessments, and live meetings.

Our systems operate in real time at scale, where reliability, latency, and uptime directly impact user success. This makes infrastructure stability a core part of the product experience.

Role Overview We are hiring a proactive, systems\-minded Site Reliability Engineer to ensure that LockedIn AI’s production systems remain highly reliable, scalable, and performant.

This role sits at the intersection of infrastructure, software engineering, and operations. You will be responsible for maintaining the reliability of real\-time AI systems that serve over 1 million users globally.

You will own production stability across cloud infrastructure, Kubernetes environments, AI inference systems, APIs, and real\-time services.

Key Responsibilities System Reliability \& Performance Own reliability, availability, and performance of production systems Define and manage SLIs, SLOs, and error budgets aligned with product goals Design fault\-tolerant and self\-healing system architectures Continuously optimize system latency, throughput, and resource usage Identify bottlenecks and improve system performance at scale

Infrastructure as Code \& Cloud Systems Design and manage cloud infrastructure across AWS, GCP, or Azure Implement Infrastructure as Code using Terraform, Pulumi, or CloudFormation Manage Kubernetes clusters supporting microservices and AI workloads Ensure infrastructure is versioned, reproducible, and scalable Optimize cloud costs while maintaining performance and reliability

Observability \& Monitoring Build observability systems using metrics, logs, and traces (Prometheus, Grafana, Datadog, ELK, etc.) Design alerting systems that reduce noise and improve incident response Implement distributed tracing for microservices and AI pipelines Monitor AI\-specific metrics including latency, GPU usage, and token throughput Provide real\-time visibility into system health and performance

Incident Response \& Reliability Engineering Lead incident response for production outages and system failures Participate in on\-call rotations and coordinate cross\-team resolution efforts Conduct blameless postmortems with actionable improvements Build runbooks, playbooks, and escalation procedures Track MTTR and reliability trends to drive continuous improvement

CI/CD \& Release Engineering Design and maintain CI/CD pipelines for safe and automated deployments Implement canary releases, blue\-green deployments, and rollback systems Ensure AI model deployments follow the same reliability standards as application code Add testing and validation gates to prevent production issues Improve deployment velocity while maintaining system stability

Security \& Compliance Implement secure infrastructure practices including IAM, encryption, and secrets management Maintain network segmentation and audit logging Ensure privacy\-first infrastructure design across all systems Manage vulnerability scanning, patching, and system hardening Collaborate with security teams to enforce DevSecOps practices

Required Qualifications Experience 3\+ years in Site Reliability Engineering, DevOps, or infrastructure roles Experience owning production systems at scale Strong background in incident response and postmortem processes Experience in fast\-paced startup or high\-growth environments

Technical Skills Strong programming ability in Python, Go, or similar Deep experience with AWS, GCP, or Azure infrastructure Strong Kubernetes and Docker expertise in production environments Experience with Infrastructure as Code (Terraform, Pulumi, CloudFormation) Hands\-on experience with CI/CD systems (GitHub Actions, ArgoCD, Jenkins, etc.) Strong knowledge of observability tools (Prometheus, Grafana, Datadog, ELK) Understanding of distributed systems and system design principles

Soft Skills Strong reliability\-first mindset with focus on failure scenarios Calm and structured thinking during production incidents Clear communication across technical and non\-technical teams Strong documentation and runbook writing ability High ownership and autonomy in system management

Preferred Qualifications Experience with AI/ML infrastructure or GPU\-based workloads Background in real\-time systems or low\-latency architectures Experience with chaos engineering practices Knowledge of AI observability (model latency, drift, throughput metrics) Multi\-cloud or hybrid cloud experience Contributions to open\-source infrastructure or SRE tooling Early\-stage startup experience (Seed to Series A)

What We Offer Equity Meaningful early\-stage equity and ownership in the company’s growth

Impact Direct responsibility for systems used by over 1 million users worldwide

Team Lean, high\-ownership engineering culture with strong autonomy

Flexibility Remote\-first setup with optional collaboration in New York City

Growth High\-speed learning environment with deep technical ownership

Culture User\-first, fast execution, and engineering excellence focused on real\-world impact

Why Join LockedIn AI Work on a category\-defining real\-time AI copilot platform Own reliability for systems critical to users during live interviews Solve complex distributed systems and AI infrastructure challenges Operate at scale with high\-performance, latency\-sensitive systems Join an AI\-native company building the future of career tools

How to Apply

**Please submit:**

Resume or CV **Short note including:** Why you want to join LockedIn AI Whether you have used the product Suggested improvements for reliability or performance **Optional:** GitHub, SRE projects, or technical writing samples

Equal Opportunity LockedIn AI is committed to building a diverse and inclusive workplace. All hiring decisions are based on merit, skills, and business needs.

**Apply For :** Site Reliability Engineer **Apply For:** \* PERSONAL INFORMATION First name\* Surname\* Home telephone number Mobile telephone number\* E\-mail address\* What is the best time to contact you?\* Please Select Morning Lunch time Evening Afternoon Doesn't Matter About You /CV Upload your documents or Cv **Maximum file size:** 10 MB **Only :** MS wrord,PDF,Zip file Apply Job

Ready to reach the decision-maker?

Set this role as a target and your agent does the sourcing, finds the verified email, writes the pitch, and follows up — on autopilot.

Start your hunt

Don't apply into the void.

Mechanical Design Engineer I

Cyber Threat Intelligence Analyst (DoD Secret Clearance)

Software Engineer/Testing Intern

Analyst, Data Exchange - Eligibility

Customer Data Analyst

Analyst, Data Exchange - Eligibility