Finding Out What’s Going Wrong Under The Hood? 🧐
In order to find out what’s causing the performance issue in GraphQL, we used the following approaches.
Profiling
We started by profiling one of the production servers during peak traffic with the help of the pm2 enterprise version profiling tool. It allows you to download CPU and memory profiles for a particular process which further can be analyzed using various tools like chrome performance tool, vs code, etc.
CPU Profile
On analysis of CPU profile, we found out that
- On average 20% of total CPU time was being spent by garbage collectors.
- GraphQL validate operation was taking a 3% of CPU time
- Some of the resolver code was long-running
GC profile
In CPU profile analysis we found out that GC was contributing majorly to CPU time, so we wanted to analyze GC further. To analyze GC we used these tools and figured that scavenger GC was happening too frequently.
Inline Caching Heuristic
V8 engine performs a lot of run time optimisation based on certain heuristics. Inline caching is the crucial element to making JavaScript run fast.
We used Deoptigate to identify the hot code which was not being optimised by the v8 engine. We found out a few of the utils functions were megamorphic and were not being optimised by v8.
Analyse APM (Datadog) data
dd-trace (Datadog library for nodeJS) automatically instruments graphQL module. It provides APM data based on queries. Based on this we were able to spot frequently invoked queries.
In order to gauge the complexity of a query, we track the time taken by graphql’s execute operation and the average number of HTTP calls made inside a query.
Micro benchmarking
We benchmarked each widely used library function or pattern against their alternatives using Benchmark.js. Some of the comparisons are described in the next section.
Learnings
- With dataloader and multi-level caching in place, the network was already very optimised.
- Based on our load test results we could see that CPU was the bottleneck for us.
- There were no low-hanging fruits.
Approaches for optimising performance
With all the profiling and benchmarking tools handy we started making small changes and measuring them for performance gain.
1. GC Tuning
In our GC profile analysis, we found that for 3 minutes of load testing,
- the server was spending 30.6 seconds in GC.
- On further analysis, we found that during this period 556 scavenger collections were performed.
Based on this we could see that the problem of the GC is concentrated in the Scavenge collection stage, that is, the memory collection of the Young Generation. The GraphQL server was generating a large number of small objects in Young Generation Space. And hence, triggering scavenge collection. In this way, the problem boiled down to optimising the Young generation to improve application performance.
In nodeJS, the size of young generation space is determined by the flag `— max-semi-space-size` which defaults to 16MB. We tried increasing the max-semi-space-size value with values as 128MB, 256MB, and 512MB. We did a load test with all these values and noticed that at 256MB system was at peak performance. After deploying this optimisation on production CPU utilisation went down by 12%.
2. Fastify Web Server
Fastify is a fast and low overhead web framework designed for NodeJS. On public benchmark, it promises to be 5 times faster than the Express framework. Since all the plugins/middleware used were portable we experimented with Fastify. On production, we got 10% reduction in CPU utilisation.
3. R.curry vs Uncurried
To improve the compositionality of our code we had defined a ton of curried functions using Ramda. But every abstraction has a cost associated with it. On profiling, we found out that the `curry` function was taking up a lot of CPU.
Benchmarks show that removing curry from the code makes it up to 100 times faster. Curry function is at the core of Ramda, almost every function is curried. So from this, we came to the conclusion that Ramda is becoming a performance hog for us.
4. Immutability vs Mutability
In our code, we had written immutable code everywhere. Looking at this benchmark we decided to remove immutability from the hot code.
5. Monomorphic vs Polymorphic vs Megamorphic Functions
V8 engine optimises monomorphic and polymorphic functions at the run time making them way faster than megamorphic functions. We converted a few frequently invoked megamorphic functions to multiple monomorphic functions and observed performance gain.
6. Lazy Evaluation
In GraphQL a resolver can either be an expression evaluating a value or a function that returns a value (or a promise of value). Resolver expressions are eagerly evaluated irrespective of fields invoked in any query.
In the above example if the client queries only id and title then also groupPlayerByType will be executed. In order to prevent such unnecessary invocations, we can wrap these operations inside a function.
This will ensure that groupPlayerByType will be called only when groupedPlayers is queried.
7. Caching query parsing and validations
Graphql performs 3 steps to evaluate the result of each query.
- Parse (Creates AST from the query)
- Validate (validate AST against the schema)
- Execute (Recursively execute the resolvers)
From the CPU profile, we found out that 26% of the CPU time was spent in the validation phase. The server was doing query parsing and validation for every request
But on production, we get requests for a limited set of queries so we could actually parse the query and validate it once for each query type and cache them. In this way, we were able to skip the redundant parsing and validation steps for subsequent queries.
8. Infrastructure Tuning
All requests coming to `https://www.dream11.com/graphql` get routed to multiple load balancers using weighted DNS. Weighted DNS doesn’t guarantee the exact distribution of requests because of DNS caching at the client-side i.e requests coming from a client will go to the same load balancer for a particular period of time (DNS TTL). So even if we assign equal weights to all load balancers there would be some load balancers that would get extra requests, which puts extra load on instances behind the load balancers.
This variance in distribution is directly proportional to the number of load balancers and DNS TTL.
- After all the optimisation we were able to reduce the number of servers that allowed us to reduce the number of load balancers.
- We tuned the TTL value to reduce the variance in traffic distribution.
Results
We started the optimisation project in November 2020 and had only 5 months to figure out performance hogs, optimise, test, deploy and scale down before IPL 2021. We did massive refactors throughout the project and did multiple deployments in this period. Following are some of the results:
Latency
In IPL 2021, overall p95 latency was reduced by 50% in comparison to IPL 2020 which resulted in a better user experience and allowed us to reduce infrastructure footprint further.
GQL time
Average GraphQL execution time is reduced by 70%
Average Performance
We tracked average performance week on week and after all the optimisations and infrastructure tuning it was improved by more than 50%.
Relative Cost
From being the most expensive service by a very high margin, GraphQL is now comparable to other services in terms of cost.
Instance Count
After doing lots of micro-optimisation and some infrastructure tuning. We were able to serve 5M concurrency with 80% fewer instances in the first half of IPL 2021 as compared to IPL 2020. Our daily average cost is reduced by 60% and projecting a similar trend will help us to save more than 1 million dollars during IPL 2021.
Key Takeaways
“ Aggregation of Marginal Gains ”
When we started we couldn’t find any low-hanging fruits. We approached our target of reducing infrastructure from the first principle. We questioned every decision which was taken right from choosing NodeJS as the stack to AWS instance type. We did load tests for the slightest improvement we could think of. Some of the key takeaways from this project would be:
- Instrumenting efficient ways to measure performance is key for optimising performance.
- Multiple optimisations of small improvements can give you larger aggregated performance improvements.
- At the edge layer, SLA for latency can be relaxed.
If you are interested in solving complex engineering problems at scale, join the Dream11 team by applying here.
from Hacker News https://ift.tt/3AcTc89
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.