Friday, January 13, 2023

API Mismatch: Why bolting SQL onto noSQL is a bad idea

TLDR; Use abstractions that are designed to hide the complexity of the underlying technology, not those that expose its limitations and do not match its semantic model.

I recently started experimenting with Remix for a new side-project idea using a tutorial with a starter template, that happened to include Prisma (a SQL data access layer, self-described TypeScript ORM) with sqlite as its database.

I've not had that much to do with Prisma, but its API seems self-explanatory. The automatic schema management, and the schema migration feature, seem like an excellent way to manage new tables and indexes you add to your code through the deployment cycle (something many ORMs don't give you).

This got me thinking about DynamoDB (a database I use day to day, and for which I maintain dynaglue, a single-table mapping layer for TypeScript/JavaScript), and made me wonder if a DynamoDB adapter existed. I stumbled across this GitHub issue, and arrogantly shot off this tweet not expecting any replies.

But DynamoDB supports SQL

As one commenter pointed out, DynamoDB supports PartiQL, which helps provide something of a SQL like interface to DynamoDB, which is completely true, and on the face of it, would be beneficial for creating a DynamoDB adapter for ORM layers like Prisma.

What that conceals from the discussion, however, is that the supported subset of PartiQL is only that which maps directly onto DynamoDB design. That subset is no-where near the set of features that are considered standard in a SQL database.

For starters, APIs like groupBy, count and aggregate, update and updateMany have no DynamoDB equivalent (and hence no PartiQL equivalent). They can be simulated by an adapter, but that means Prisma is no longer behaving as an ORM, but is inheriting responsibilities of the database itself.

Much worse, is the fact that PartiQL will supports parts of findUnique, findMany, etc., but unless the developer specifies a primary key field in their WHERE clause, it will fallback to using a scan on the table. A scan can be reasonably fast for a small number of items, but becomes linearly slower on larger tables.

Developers working with relational databases will be well aware of this problem - a query that performs adequately in development (because there are few items in your table) will perform atrociously in production, the fix being simply to add an index. This is the case for DynamoDB as well, but there are limits on how you can index existing documents (you will need a data migration if you need to index a nested value).

If pay-per-request pricing is used with DynamoDB, this scan can also be expensive, as users are charged per items scanned in the partition, not what is returned in the response. (When provisioned capacity is used, the developer is more likely to saturate the available read units, especially if parallel scanning is deployed). A relational database will just degrade in overall performance until it is scaled upwards.

Do you have a plan?

Relational databases are smart enough to work out what indexes they have available to them and "plan" how to execute a query based on what is available. This means they may be able to satisfy the entire WHERE clause using an index, fall back to scanning and index and then filtering the results, combining and filtering the results of looking up or scanning multiple indexes, or scanning and filtering the entire table.

The decisions a query planner makes about what indexes to use or not, comes down to both its knowledge of those indexes, and the statistics collected on the use of those indexes. This allows for a relatively sophisticated query execution based on the database engine's initimate knowledge of the underlying data storage and its ongoing usage.

The statistics a SQL engine can store includes how long it takes to access an index based on certain query parameters, how much data is populated in that index, etc., which are all used to inform the query planner's decision making process.

On the other hand, a DynamoDB database provides none of this to PartiQL (its not even available to client applications). At best, it can identify indexed fields in the WHERE clause, and select indexes that may help to speed up the query by reducing the set of results to scan.

However it won't be able to make any sophisticated decisions on how to use those indexes. It does not have any information about how sparsely those indexes are populated, relative lookup times, etc. that a relational engine would normally use to speed up execution. In most cases, its going to fall back to a full database scan.

Why a DynamoDB scan is not the same as a relational SQL scan

In my mind, this isn't even the worst part, which boils down to a fundamental difference is how data is stored and retrieved in DynamoDB vs relational databases.

In relational engines, the query is executed very close to the where the data is stored (typically on the same node). This means that a scan can be relatively fast, even over a large number of items, because the retrieval and filtering is performed on the database node before a response is streamed back to the application.

On the other hand, DynamoDB will perform filtering on the node, but it does it in fixed size chunks. The application must wait an entire round-trip over the network to request a scan on the next part of the index. This problem is made much worse when most of the index's elements are filtered out.

The relational database only has to wait between the memory and the storage drives (or if splitting over a cluster, for batches of results to come back from each node). Each round-trip from the application and the database can be optimised to return as much as possible, which is not possible with DynamoDB.

(The limits on scanning can be ameleorated using a parallel scan, but this is best done as a deliberate implementation detail where the potential costs and resource usage are considered beforehand, not as an off-hand consequence of an untested query).

Abstractions and depth of understanding

The value in a layer like Prisma is that it abstracts away the hairier details of a database engine from the developer, leaving the indexing and query writing to the underlying adapter. Developers can focus on business logic instead of technical details.

When the database layer underlying the Prisma API cannot satisfy the majority of the API, or runs with unexpected performance and cost issues when the supported APIs are found, the value underlying the abstraction is lost. The developer now has to understand the limits of DynamoDB and be careful what they query to make sure it is optimally using indexes.

This is something where a relational database is more forgiving (just add an index when performance degrades). For developers that does understand the limits of DynamoDB, they will then be needing to migrate data to support new indexes and "drop-down" to the DynamoDB interface itself when they cannot perform what they need in the database adapter.

(I did mention before that it would theoretically be possible to build the missing functionality into the Prisma adapter itself. There may be value in building a SQL like interface over DynamoDB, but you're sacrificing one of the reasons for using it in the first place - consistency of performance under load - while absorbing the cost of maintaining what would normally be a battle-tested component of the database itself)

Use the platform, not the abstraction

Does this mean you should just use a relational database?

This is where I think technology selection is important. NoSQL databases like DynamoDB have the advantage of consistent performance and automatic scaling under increasing load, which is not something that can normally be said of relational databases.

The trade-off, of course, is learning how to use the technology, ensuring that your access patterns are well suited for NoSQL databases, and designing your data storage to optimise queries with those access patterns. You need to understand the limits of the platform and use other systems to fill those requirements not fulfilled by DynamoDB.

That said, if your team's experience is with relational databases, or your application is not suited for NoSQL workloads, SQL databases offer an excellent solution backed by decades of refinement, flexible storage options and well understood limits and scaling requirements. You're in safe hands backed by broad institutional knowledge of the technology (you'll never have a problem finding a developer who has worked with SQL).

Throw Prisma in the mix, and you have a well-worn abstraction (with the additional benefits of schema management that it brings) that will serve your needs well and let your developers focus on what matters.



from Hacker News https://ift.tt/ewLMGUr

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.