Friday, September 24, 2021

Lessons learned from running GraphQL at scale

Finding Out What’s Going Wrong Under The Hood? 🧐

In order to find out what’s causing the performance issue in GraphQL, we used the following approaches.

Profiling

We started by profiling one of the production servers during peak traffic with the help of the pm2 enterprise version profiling tool. It allows you to download CPU and memory profiles for a particular process which further can be analyzed using various tools like chrome performance tool, vs code, etc.

CPU Profile

On analysis of CPU profile, we found out that

  • On average 20% of total CPU time was being spent by garbage collectors.
  • GraphQL validate operation was taking a 3% of CPU time
  • Some of the resolver code was long-running
CPU profile of GraphQL server during load test

GC profile

In CPU profile analysis we found out that GC was contributing majorly to CPU time, so we wanted to analyze GC further. To analyze GC we used these tools and figured that scavenger GC was happening too frequently.

Inline Caching Heuristic

V8 engine performs a lot of run time optimisation based on certain heuristics. Inline caching is the crucial element to making JavaScript run fast.

We used Deoptigate to identify the hot code which was not being optimised by the v8 engine. We found out a few of the utils functions were megamorphic and were not being optimised by v8.

Analyse APM (Datadog) data

dd-trace (Datadog library for nodeJS) automatically instruments graphQL module. It provides APM data based on queries. Based on this we were able to spot frequently invoked queries.

In order to gauge the complexity of a query, we track the time taken by graphql’s execute operation and the average number of HTTP calls made inside a query.

Datadog UI Query wise APM data
Internal Metric of a Query

Micro benchmarking

We benchmarked each widely used library function or pattern against their alternatives using Benchmark.js. Some of the comparisons are described in the next section.

Learnings

  • With dataloader and multi-level caching in place, the network was already very optimised.
  • Based on our load test results we could see that CPU was the bottleneck for us.
  • There were no low-hanging fruits.

Approaches for optimising performance

With all the profiling and benchmarking tools handy we started making small changes and measuring them for performance gain.

1. GC Tuning

In our GC profile analysis, we found that for 3 minutes of load testing,

  • the server was spending 30.6 seconds in GC.
  • On further analysis, we found that during this period 556 scavenger collections were performed.

Based on this we could see that the problem of the GC is concentrated in the Scavenge collection stage, that is, the memory collection of the Young Generation. The GraphQL server was generating a large number of small objects in Young Generation Space. And hence, triggering scavenge collection. In this way, the problem boiled down to optimising the Young generation to improve application performance.

In nodeJS, the size of young generation space is determined by the flag `— max-semi-space-size` which defaults to 16MB. We tried increasing the max-semi-space-size value with values as 128MB, 256MB, and 512MB. We did a load test with all these values and noticed that at 256MB system was at peak performance. After deploying this optimisation on production CPU utilisation went down by 12%.

Load test results at different max-semi-space-size

2. Fastify Web Server

Fastify is a fast and low overhead web framework designed for NodeJS. On public benchmark, it promises to be 5 times faster than the Express framework. Since all the plugins/middleware used were portable we experimented with Fastify. On production, we got 10% reduction in CPU utilisation.

3. R.curry vs Uncurried

To improve the compositionality of our code we had defined a ton of curried functions using Ramda. But every abstraction has a cost associated with it. On profiling, we found out that the `curry` function was taking up a lot of CPU.

Two versions of the add function
Benchmark result add vs addCurried

Benchmarks show that removing curry from the code makes it up to 100 times faster. Curry function is at the core of Ramda, almost every function is curried. So from this, we came to the conclusion that Ramda is becoming a performance hog for us.

4. Immutability vs Mutability

Two ways of adding a key to an object
Benchmark mutable vs immutable

In our code, we had written immutable code everywhere. Looking at this benchmark we decided to remove immutability from the hot code.

5. Monomorphic vs Polymorphic vs Megamorphic Functions

V8 engine optimises monomorphic and polymorphic functions at the run time making them way faster than megamorphic functions. We converted a few frequently invoked megamorphic functions to multiple monomorphic functions and observed performance gain.

Megamorphic to monomorphic conversion

6. Lazy Evaluation

In GraphQL a resolver can either be an expression evaluating a value or a function that returns a value (or a promise of value). Resolver expressions are eagerly evaluated irrespective of fields invoked in any query.

GraphQL schema
Eager resolver

In the above example if the client queries only id and title then also groupPlayerByType will be executed. In order to prevent such unnecessary invocations, we can wrap these operations inside a function.

Lazy Resolver

This will ensure that groupPlayerByType will be called only when groupedPlayers is queried.

7. Caching query parsing and validations

Graphql performs 3 steps to evaluate the result of each query.

  1. Parse (Creates AST from the query)
  2. Validate (validate AST against the schema)
  3. Execute (Recursively execute the resolvers)

From the CPU profile, we found out that 26% of the CPU time was spent in the validation phase. The server was doing query parsing and validation for every request

CPU profile of GraphQL server during the load test

But on production, we get requests for a limited set of queries so we could actually parse the query and validate it once for each query type and cache them. In this way, we were able to skip the redundant parsing and validation steps for subsequent queries.

8. Infrastructure Tuning

All requests coming to `https://www.dream11.com/graphql` get routed to multiple load balancers using weighted DNS. Weighted DNS doesn’t guarantee the exact distribution of requests because of DNS caching at the client-side i.e requests coming from a client will go to the same load balancer for a particular period of time (DNS TTL). So even if we assign equal weights to all load balancers there would be some load balancers that would get extra requests, which puts extra load on instances behind the load balancers.

This variance in distribution is directly proportional to the number of load balancers and DNS TTL.

  • After all the optimisation we were able to reduce the number of servers that allowed us to reduce the number of load balancers.
  • We tuned the TTL value to reduce the variance in traffic distribution.

Results

We started the optimisation project in November 2020 and had only 5 months to figure out performance hogs, optimise, test, deploy and scale down before IPL 2021. We did massive refactors throughout the project and did multiple deployments in this period. Following are some of the results:

Latency

In IPL 2021, overall p95 latency was reduced by 50% in comparison to IPL 2020 which resulted in a better user experience and allowed us to reduce infrastructure footprint further.

P95 latency graph of GraphQL

GQL time

Average GraphQL execution time is reduced by 70%

GraphQL Execute time

Average Performance

We tracked average performance week on week and after all the optimisations and infrastructure tuning it was improved by more than 50%.

A week on week MIS data plot

Relative Cost

From being the most expensive service by a very high margin, GraphQL is now comparable to other services in terms of cost.

AWS Infrastructure cost of services at Dream11

Instance Count

GraphQL production instance count for serving 5M concurrent users

After doing lots of micro-optimisation and some infrastructure tuning. We were able to serve 5M concurrency with 80% fewer instances in the first half of IPL 2021 as compared to IPL 2020. Our daily average cost is reduced by 60% and projecting a similar trend will help us to save more than 1 million dollars during IPL 2021.

Key Takeaways

Aggregation of Marginal Gains

When we started we couldn’t find any low-hanging fruits. We approached our target of reducing infrastructure from the first principle. We questioned every decision which was taken right from choosing NodeJS as the stack to AWS instance type. We did load tests for the slightest improvement we could think of. Some of the key takeaways from this project would be:

  • Instrumenting efficient ways to measure performance is key for optimising performance.
  • Multiple optimisations of small improvements can give you larger aggregated performance improvements.
  • At the edge layer, SLA for latency can be relaxed.

If you are interested in solving complex engineering problems at scale, join the Dream11 team by applying here.



from Hacker News https://ift.tt/3AcTc89

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.