Projected graphs, or subgraphs in mathematical terms, are graphs with an edge set and a vertex set belonging to a subset of an edge set and a vertex set of another graph. If you wanted to run a query module or any algorithms from the MAGE library on a subgraph, you needed to implement a custom procedure to accept a list of nodes and relationships, and then run the algorithm on the subset of those entities. This fine-tuning was necessary because in Memgraph you could only retrieve the whole graph and not parts of the database.
In recent months, users started to ask more frequently about the ability to run algorithms on a subgraph, and we in Memgraph listen to what our users need. That’s why we were set on extending the implementation of C API functions and brought to you the graph project feature that enables running algorithms on a specific subset of a graph stored in the database.
The feature consists of the project()
function, which in the following example, creates a subgraph from nodes connected with a CONNECTS TO
relationship type. The query then runs a PageRank on that subgraph:
MATCH p=(n:Node)-[:CONNECTS_TO]->(m:Node)
WITH project(p) as graph
CALL pagerank.get(graph, …) YIELD node, rank
RETURN node, rank;
How cool is that? Do you want to know how we did it? Read on!
Designing a feature: enabling the possibility to run query module on subgraph
The main question we were facing was how to design the creation of a subgraph from a query standpoint. The answer lies in the project()
aggregation function. The function accepts path
as an argument, and stores all vertices and edges of the path inside the graph structure found by the MATCH
clause:
MATCH p=(n:Node)-[:CONNECTS_TO]->(m:SpecificNode)
WITH project(p) as graph
But how can we pass this result to all the other procedures in Memgraph, since they all accept a different number of required and optional parameters? We agreed that the best course of action was to make the result of the aggregation function project()
the first parameter of a procedure if that procedure needs to operate on a projected subgraph. All the other parameters defined by the procedure specification then follow the subgraph parameter.
MATCH p=(n:Node)-[:CONNECTS_TO]->(m:SpecificNode)
WITH project(p) as graph
CALL pagerank.get(graph, …) YIELD node, rank
RETURN node, rank;
Extending the query module implementation
Memgraph database works with a lot of different data types, from containers like List
and Map
across primitives like int
and float
, all the way to the Vertex
, Edge
, and Path
. To handle all these data types, Memgraph uses tagged unions. Each instance of a class represents a specific type. This is a pretty common pattern in computer science. The data structure holds a type from a list of acceptable types and the tag field indicates which type is used. We extended the list with a Graph
type. When a user calls a procedure and parameters are passed to the procedure call, the procedure implementation makes one more check of the first parameter. If that first parameter is of the Graph
type the procedure will operate on a subgraph.
But then, things started to get complicated. When a procedure is called, a struct called mgp_graph
is initialized and holds the reference to DbAccessor
representing the database accessor:
struct mgp_graph {
memgraph::query::DbAccessor *impl;
memgraph::storage::View view;
…
};
DbAccessor
is a class that serves as a wrapper around the storage accessor. Basically, DbAccessor
enables access to all the vertices in the database, as well as operations such as, inserting a new vertex, inserting a new edge, removing a vertex or removing an edge, and so on. But, DbAccessor
can give access only to the whole graph stored in the database. At this point, there is no way to retrieve only a subset of edges or a subset of vertices. Furthermore, mgp_graph
is not the only structure that has access to the whole underlying graph in the database. If you look at the struct mgp_vertex
, you will see that it holds the reference to VertexAccessor
, which is used to return all the outbound and inbound edges of that vertex.
struct mgp_vertex {
…
memgraph::query::VertexAccessor impl;
mgp_graph *graph;
};
Since most methods that C API calls work with either mgp_vertex
or mgp_graph
, we had to be careful to update all methods using the mgp_vertex
and mgp_graph
struct to properly operate either on the whole graph or the projected graph, depending on the user request. It was up to us how that update would happen, but we knew we wanted to keep performance and make operating on the subgraph as performant as possible.
The first idea was to extend the mgp_graph
to have reference to the Graph
object holding the nodes and edges projected in the MATCH
clause.
struct mgp_graph {
memgraph::query::DbAccessor *impl;
memgraph::query::Graph *graph;
memgraph::storage::View view;
…
};
But then, each function would require a check to confirm that the Vertex
or Edge
objects it’s returning are a part of the projected graph. So instead of the following function
mgp_error mgp_vertex_get_id(mgp_vertex *v, mgp_vertex_id *result) {
return WrapExceptions([v] { return mgp_vertex_id{.as_int = v->impl.Gid().AsInt()}; }, result);
}
we would get a function that looks like this:
mgp_error mgp_vertex_get_id(mgp_vertex *v, mgp_vertex_id *result) {
return WrapExceptions([v] {
auto vertex_id = v->impl.Gid().AsInt();
if (!v->graph->projected_graph->IsVertexIdInGraph(vertex_id)){
return nullptr;
}
return mgp_vertex_id{.as_int = vertex_id};
}, result);
}
Obviously, this idea was a bust. It would increase repetition and new problems would arise with each new function. Also, we would like to have more control over the code.
Another idea was to create a class that could serve as DbAccessor
, VertexAccessor
, and EdgeAccessor
together. Such a class would also hold a reference to a projected graph if one existed. If some information is needed from the VertexAccessor
or EdgeAccessor
, a reference to the object is passed from the functions C API calls to the appropriate method of this new class. Then and there we would have control over return values. But the problem with such a “cluster class" is that it would group responsibilities that should be clearly separated. That is why we discarded that idea as well.
It seemed like the best idea was the classic use of dynamic polymorphism: create a class that extends DbAccessor
and make virtual functions from the functions that the C API functions will call. In spite of being a good idea, it didn’t come without its own set of problems.
Only pay for what you use: static polymorphism and std::variant
As it is known in the C++ dev community, dynamic polymorphism should be used when the data types are not known at compile time. In such cases, before each call to a virtual method, the compiler needs to look at the v-table and resolve the address of a derived function. Compilers try hard to devirtualize calls as much as possible, but more often than not, they fail. In our case, types are also unknown at compile time, but on the other hand, we don’t need the cost of checking the v-table every time the function is called. That’s why we decided to modify it a bit by using std::variant. This C++17 technique might not only improve performance and enrich value semantics but also enable interesting design patterns. The std::variant
enables defining a list of types that can be stored in the same object. For example, std::variant<int, float, std::string>
stores either an int
, float
or std::string
. This is how mgp_graph
looks in our case:
struct mgp_graph {
std::variant<memgraph::query::DbAccessor *, memgraph::query::SubgraphDbAccessor *> impl;
memgraph::storage::View view;
…
};
The SubgraphDbAccessor
class references DbAccessor
and Graph
, and gives us control over which vertices will be returned and what data will be inserted and removed from the projected graph and consequently database. We have also updated the mgp_vertex
to hold the std::variant
of possible implementations:
struct mgp_vertex {
std::variant<memgraph::query::VertexAccessor,memgraph::query::SubgraphVertexAccessor> impl;
mgp_graph *graph;
};
SubgraphVertexAccessor
also references the projected graph and the VertexAccessor
.
We also utilized std::variant visit, which takes the variant object and calls the correct overload:
mgp_error mgp_vertex_get_id(mgp_vertex *v, mgp_vertex_id *result) {
return WrapExceptions(
[v] {
return mgp_vertex_id{
.as_int = std::visit(memgraph::utils::Overloaded{[](auto &impl) { return impl.Gid().AsInt(); }}, v->impl)};
},
result);
}
In the end, std::varaint
and std::variant visit
solved all of our problems.
Serialization
In the end, a little something about the serialization, or the output.
Since the Bolt protocol doesn’t support graph serialization, we decided to output the projected graph as a map containing two keys: nodes
, containing the list of nodes, and edges
, containing the list of relationships.
So this query:
MATCH p=(n:Node)-[:CONNECTS_TO]->(m:SpecificNode)
WITH project(p) as graph
RETURN graph;
will return a result looking like this {"nodes:":[node2, node2, …], "edges":[edge2, edge2, …]}
.
Using graph projection feature on query modules
And here you have it! A whole new world of possibilities opened up. You can now do graph analysis with PageRank, degree centrality, betweenness centrality, or any other algorithm on subgraphs without any additional adjustments. You can fire up a graph machine learning algorithm, such as Temporal graph networks and split the dataset inside the query to do training and validation without splitting the dataset programmatically. And last but not least, you can fire up your graphics card and use cuGraph to run graph analysis in seconds. All of these and even more algorithms you can find in Memgraph MAGE library. If you have any questions feel free to join our Discord community, and if you find something’s missing, open an issue on the GitHub page for both MAGE and Memgraph.
This feature was really fun and challenging to implement, but we at Memgraph are really glad it saw the light of the day because we believe it will help a lot of people with their graph analysis.
If you like what you’ve read, feel free to give a star to our main GitHub repositories - Memgraph database and the MAGE repo. We also appreciate contributors!
Best of luck in your further endeavors!
from Hacker News https://ift.tt/dwgPBF0
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.