Site Reliability Engineer

Knotch
Job LocationUS Remote
Job TagEngineering

Location: US Remote

Description:

About Knotch

As the global leader in Content Intelligence, Knotch’s mission is to empower brands to unlock the true value of their content by using data-driven strategies. With the Knotch Content Intelligence Platform, companies conduct competitive research and measure the performance of their content in real-time. Through our unique ability to provide a 360-degree view of all your content, including paid and owned, Knotch allows companies to connect content to business outcomes to enhance brand, increase ROI and build audiences.

Responsibilities:

The Site Reliability Engineer

As Knotch’s Site Reliability Engineer, you will play a critical role in maturing the reliability of our platform and infrastructure as well as our information security posture. You will minimize the risk of reliability related failure outcomes related to durability, availability, performance, and correctness. You will also ensure we have enough capacity to scale the Knotch infrastructure and help our Engineering team mitigate existing and potential performance and scalability risks. Most importantly, you’ll ensure that our efforts are measured and that we can tackle most impactful improvements first to meet our internal and external SLAs.

Qualifications:

You will add tremendous value at Knotch if you have:

  • Experience implementing advanced monitoring for platforms and applications.
  • Proficiency in infrastructure as code (Terraform or equivalent alternative) in the cloud (preferably AWS).
  • The ability to work closely with leadership of various teams to assess existing and potential problems as well as design solutions both on process and automation/tooling.
  • In-depth knowledge and experience with containerization (Docker) and scaling with Kubernetes (preferably EKS).
  • Ability to integrate into an existing process and effectively contribute to its improvement

You will be successful here if you:

  • Have 2-3 years experience on a SRE team and awareness of SRE workflows and processes.
  • Have 3-5 years on-call experience on a team that owns infrastructure and application performance/reliability; infrastructure tool/services contribution/maintenance.
  • Are a problem forecaster versus simply a problem solver.
  • Are growth-oriented, self-driven, independent thinker and a collaborative team player.
  • Have a deep understanding of networking protocols (network security experience a plus) and a solid understanding of security best practices.
  • Have impeccable communication and collaboration skills.
  • Have experience with applying SRE framework and establishing, maintaining and improving KPIs to meet criteria set by SLAs.
  • Have strong programming experience in python (golang acceptable).
  • Have experience with databases, data streaming, messaging systems, data processing/transformation systems.
  • Are proficient in diagnosing technical problems, debugging code and automating remediation.

We also appreciate (but don’t require):

  • Experience with data pipelines and data warehouse/lake operations, design and optimization.
  • Proficiency in contributing/maintaining/designing self-service tools/services to support the Engineering organization in day-to-day operations and infrastructure orchestration.
  • Experience writing/maintaining/supporting microservices and understanding of common design patterns and orchestration of microservices within an internal platform.

Salary Range: $135,000 – $150,000