Microservices

Scaling Uber

  • Background
    • 200 to 2000 engineers in a year
    • Hundreds of services in production, number changing all the time
  • General point: everything is a tradeoff. Try to make them intentionally
  • Upsides
    • Allows teams to release independently
    • Teams own their own uptime
    • Lets you pick the best tool for the job
  • Downsides
    • Managing a distributed system
    • Everything is an RPC
      • If using HTTP, need standards for how it’s used, or else a lot of incidental complexity to understand how another API works. Servers are not browsers: just set it up to call a function
      • Calls are slow
      • Static types would be nice
    • What if it breaks?
      • How do you make sure the right people are paged for the root of the failure?
    • If you own your uptime, you can block other teams if they need a fix in your service
      • Can they release your service?
    • Temptation to build around problems instead
      • Specifically, political/organizational problems
    • You get to keep your biases instead of learn
    • Using separate languages means
      • Hard to share code
      • Hard to contribute fixes to other codebases
      • Hard to move between teams
      • Fragments the org culture into tribes
    • More difficult to understand the service in the larger context, can’t see all in one place
      • Teams set up separate dashboards for status
      • Tracking performance
        • Each programming language has its own different tools
        • Work needed to get them into a consistent format (is that OSS now?)
        • Fanout: the latency of the slowest step sets the speed for your whole system (I guess it refers to services calling many other services)
          • A service can be fast, but if another service needs to make many individual calls to it, the overall speed is slow
    • Have to be able to trace through all services
      • There are tools
      • Cross-language context propagation. At least pass a common ID and log it. Ideally other fields passed along transparency
      • The overhead slows things down, so do sampling of a portion of the requests
    • Need consistent logging
      • Need to give them tools to do it consistently
      • Logging floods can amplify problems
    • Load testing
      • Have to test against production, without breaking metrics, all the time
      • Have to have context that tells services that it’s a test request
      • Keep systems near their peaks all the time
    • Have to do failure testing (chaos monkey etc)
    • Migrations: old stuff still has to work
      • Things can never be fully immutable: at some point you’ll have to make a cross-cutting change. A security patch if nothing else
      • Mandates are bad. Use carrots, not sticks. Unless it’s for security or compliance
    • Services allow people to put their own interests or team’s above the company’s
  • Considerations
    • Have to decide how many repos: one vs many
    • Performance doesn’t matter until it does
    • The build/buy tradeoff is hard: you built it but now it’s a commodity