Senior Site Reliability Engineer

Job description

What makes Ably special?

Ably helps power next generation digital experiences through the only truly distributed global messaging cloud-based websocket and protocol agnostic platform. Read a recent blog post on the distributed systems problems we think about and work on each day.


What we can offer you

Working at Ably means you are working on a cutting-edge distributed internet-scale platform that spans 20+ data centres, soon to support multiple clouds delivering potentially trillions of messages for developers. You will learn with the best. You will have autonomy and freedom to experiment and improve. You will be part of a dynamic team and a business that is growing rapidly.

 

Job description

If you don't know what a Site Reliability Engineer is, we recommend you first read Google's definition of a Site Reliability Engineer, which we are in agreement with.


As a Senior Engineer in our Site Reliability Engineering team, you’ll build solutions to enhance availability, performance and stability of the Ably platform as well as developing new network services whilst automating away repetitive work. You'll also respond to pings, pages and alerts to investigate issues in our products that you can really sink your teeth into. You'll be working on non-production and production environments, monitoring, data collection and configuration management, as well as disaster recovery planning, capacity engineering, reliability improvement initiatives and platform automation. The team needs someone who can ask questions, learn from others and turn chaos into order.


This role would be a great fit for someone with creative and innovative problem solving skills with a willingness to take responsibility for the code you write all the way to production. You will develop and implement solutions that operate at scale - seeing your own technology efforts directly improve the reliability of our products. Our teams are empowered and expected to improve our products to truly deliver a reliable experience to customers. 


If you're excited by working on truly complex problems at internet-scale with smart engineers, you'll enjoy working at Ably.

 

Our infrastructure stack currently comprises of mostly:

  • Infrastructure languages: Ruby, Go, Bash.
  • Service languages: Go, Elixir, Node.js and some C.
  • Mostly AWS based, but we are experimenting with supporting other clouds.
  • Architecture: Exclusively Docker containers for all services, servers are immutable, ephemeral and disposed of frequently, code is packaged as slugs, datacenters (circa 20) are isolated and autonomous, critical shared services always have redundancy baked in, manual configuration of any infrastructure is a smell.
  • Data services: Cassandra (our realtime datastore, 3 regions, 6 data centers), Influx, Elastic, Kibana, Grafana, etc.
  • Web site: We use Rails & Heroku for simplicity. The web service is not part of our "core product" and thus has lower uptime requirements.

See https://goo.gl/cDUirr and https://goo.gl/XDpmBi for a taster on the lengths we go to at each layer in the stack to ensure 100% service uptime.  

 

Day to day you can expect to be working on:

  • Writing Ruby code for our infrastructure automation, orchestration, configuration and continuous integration testing of our infrastructure.
  • Writing Go code for our core routing, workers and infrastructure services.
  • Making extensive use of a wide range of AWS services. Whilst we primarily use AWS for our infrastructure, in time we expect that to change as we span other cloud services.
  • Managing and developing out our continuous integration services that test every aspect of the service, from infrastructure tools, to our health servers, routers, realtime services, protocol adaptors and client libraries.  Our CI environment is mature, yet we would like to continue to evolve our CI environments to help improve the robustness of the platform and reduce risk of regressions.
  • Being exposed to our other development environments such as Node.js and Elixir, both used extensively in our realtime services.
  • Working with the realtime engineering team to ensure our infrastructure supports the ever changing networking, security and processing requirements.
  • Collaborating with the team to design, discuss and implement new features and services.
  • Diagnosing and fixing bugs in all areas of our platform.  You will often be working at very low levels in the network stack to help diagnose difficult to identify distributed problems.
  • Work with the engineering team to enable them to take responsibility for the complete lifecycle of the features and code they deliver i.e. pull request, reviews, testing, deploy to staging and sandbox environments, then into production environments. We are strong believers in all developers being responsible for deploying their own code.
  • Contributing to open source projects that we support or use in our products.  All of our client libraries are open source as well and may require your support at times.
  • Helping customers solve problems they are experiencing that may help us find bugs in the platform.
  • Support the wider team in regards to documentation and customer support.
  • Suggestions for new features or improvements to our protocol and API specifications.

 

Benefits

  • Salary range: €40k to €90k.
  • Employee options: Yes, negotiable.
  • Holidays: 25+ days excluding national holidays.
  • This role can be remote or on-site in our London office. However, if you are working remotely, you will need to be in a European timezone so that we can communicate effectively during business hours, and you will need to be close enough to visit our office in London occasionally.  Our preference is to have a team member near enough to commute to our London office when necessary. You will benefit from a flexible working environment in which working from home and managing your own working hours sensibly is the norm. 
  • Work in an environment where code quality, technical challenges and delivery is what we all care about. 
  • Skills development is intrinsic in the job. We're largely working on unsolved problems each day, and such, there is plenty of scope to widen your knowledge and skillset.
  • Work with genuinely nice and smart people who care about code quality and enjoying their jobs.


**** NO AGENCIES PLEASE ****

Requirements

  • Experience: A minimum of a two years of professional experience with Go which is used in all our routing and infrastructure services.  Our infrastructure automation and orchestration layer requires you to be very proficient in Ruby too. You should have experience using both statically and dynamically typed languages. Experience with Node.js and Elixir/Erlang would be nice, but not necessary. You must have solid experience managing infrastructure and CI environments, and any distributed or large scale infrastructure management is preferred. Understanding of distributed systems is beneficial.
  • Pragmatic: A problem solver excited by the prospect of automating your job away and working autonomously to solve problems and bring solutions to the team.
  • Fast Learner: We’re looking for software engineers who thrive on applying their knowledge, learning new technologies.  Our stack is diverse, and we expect it to continue to grow.
  • Testing: Experience using testing frameworks and adoption of test driven development where applicable.
  • Communication: We use tools such as Slack throughout the day to communicate, however we believe in voice conversations to discuss and solve problems. You must be proficient in spoken and written English, be eager to collaborate with the engineering team and constructively welcome code reviews.
  • Customers: Comfortable talking to customers and assisting them with their technical issues and integration.
  • Open source: We prefer developers who have contributed back to the open source community, even if those contributions are small. 


**** NO AGENCIES PLEASE ****