Microservices

Scaling Uber

Background
- 200 to 2000 engineers in a year
- Hundreds of services in production, number changing all the time
General point: everything is a tradeoff. Try to make them intentionally
Upsides
- Allows teams to release independently
- Teams own their own uptime
- Lets you pick the best tool for the job
Downsides
- Managing a distributed system
- Everything is an RPC
  - If using HTTP, need standards for how it’s used, or else a lot of incidental complexity to understand how another API works. Servers are not browsers: just set it up to call a function
  - Calls are slow
  - Static types would be nice
- What if it breaks?
  - How do you make sure the right people are paged for the root of the failure?
- If you own your uptime, you can block other teams if they need a fix in your service
  - Can they release your service?
- Temptation to build around problems instead
  - Specifically, political/organizational problems
- You get to keep your biases instead of learn
- Using separate languages means
  - Hard to share code
  - Hard to contribute fixes to other codebases
  - Hard to move between teams
  - Fragments the org culture into tribes
- More difficult to understand the service in the larger context, can’t see all in one place
  - Teams set up separate dashboards for status
  - Tracking performance
    - Each programming language has its own different tools
    - Work needed to get them into a consistent format (is that OSS now?)
    - Fanout: the latency of the slowest step sets the speed for your whole system (I guess it refers to services calling many other services)
      - A service can be fast, but if another service needs to make many individual calls to it, the overall speed is slow
- Have to be able to trace through all services
  - There are tools
  - Cross-language context propagation. At least pass a common ID and log it. Ideally other fields passed along transparency
  - The overhead slows things down, so do sampling of a portion of the requests
- Need consistent logging
  - Need to give them tools to do it consistently
  - Logging floods can amplify problems
- Load testing
  - Have to test against production, without breaking metrics, all the time
  - Have to have context that tells services that it’s a test request
  - Keep systems near their peaks all the time
- Have to do failure testing (chaos monkey etc)
- Migrations: old stuff still has to work
  - Things can never be fully immutable: at some point you’ll have to make a cross-cutting change. A security patch if nothing else
  - Mandates are bad. Use carrots, not sticks. Unless it’s for security or compliance
- Services allow people to put their own interests or team’s above the company’s
Considerations
- Have to decide how many repos: one vs many
- Performance doesn’t matter until it does
- The build/buy tradeoff is hard: you built it but now it’s a commodity