Friday, October 28, 2022

We Designed and Implemented Graph Projection Feature

Projected graphs, or subgraphs in mathematical terms, are graphs with an edge set and a vertex set belonging to a subset of an edge set and a vertex set of another graph. If you wanted to run a query module or any algorithms from the MAGE library on a subgraph, you needed to implement a custom procedure to accept a list of nodes and relationships, and then run the algorithm on the subset of those entities. This fine-tuning was necessary because in Memgraph you could only retrieve the whole graph and not parts of the database.

In recent months, users started to ask more frequently about the ability to run algorithms on a subgraph, and we in Memgraph listen to what our users need. That’s why we were set on extending the implementation of C API functions and brought to you the graph project feature that enables running algorithms on a specific subset of a graph stored in the database.

The feature consists of the project() function, which in the following example, creates a subgraph from nodes connected with a CONNECTS TO relationship type. The query then runs a PageRank on that subgraph:

MATCH p=(n:Node)-[:CONNECTS_TO]->(m:Node)
WITH project(p) as graph
CALL pagerank.get(graph, …) YIELD node, rank
RETURN node, rank;

How cool is that? Do you want to know how we did it? Read on!

Designing a feature: enabling the possibility to run query module on subgraph

The main question we were facing was how to design the creation of a subgraph from a query standpoint. The answer lies in the project() aggregation function. The function accepts path as an argument, and stores all vertices and edges of the path inside the graph structure found by the MATCH clause:

MATCH p=(n:Node)-[:CONNECTS_TO]->(m:SpecificNode)
WITH project(p) as graph

But how can we pass this result to all the other procedures in Memgraph, since they all accept a different number of required and optional parameters? We agreed that the best course of action was to make the result of the aggregation function project() the first parameter of a procedure if that procedure needs to operate on a projected subgraph. All the other parameters defined by the procedure specification then follow the subgraph parameter.

MATCH p=(n:Node)-[:CONNECTS_TO]->(m:SpecificNode)
WITH project(p) as graph
CALL pagerank.get(graph, …) YIELD node, rank
RETURN node, rank;

Extending the query module implementation

Memgraph database works with a lot of different data types, from containers like List and Map across primitives like int and float, all the way to the Vertex, Edge, and Path. To handle all these data types, Memgraph uses tagged unions. Each instance of a class represents a specific type. This is a pretty common pattern in computer science. The data structure holds a type from a list of acceptable types and the tag field indicates which type is used. We extended the list with a Graph type. When a user calls a procedure and parameters are passed to the procedure call, the procedure implementation makes one more check of the first parameter. If that first parameter is of the Graph type the procedure will operate on a subgraph.

But then, things started to get complicated. When a procedure is called, a struct called mgp_graph is initialized and holds the reference to DbAccessor representing the database accessor:

struct mgp_graph {
 memgraph::query::DbAccessor *impl;
 memgraph::storage::View view;
 …
};

DbAccessor is a class that serves as a wrapper around the storage accessor. Basically, DbAccessor enables access to all the vertices in the database, as well as operations such as, inserting a new vertex, inserting a new edge, removing a vertex or removing an edge, and so on. But, DbAccessor can give access only to the whole graph stored in the database. At this point, there is no way to retrieve only a subset of edges or a subset of vertices. Furthermore, mgp_graph is not the only structure that has access to the whole underlying graph in the database. If you look at the struct mgp_vertex, you will see that it holds the reference to VertexAccessor, which is used to return all the outbound and inbound edges of that vertex.

struct mgp_vertex {
 …
 memgraph::query::VertexAccessor impl;
 mgp_graph *graph;
};

Since most methods that C API calls work with either mgp_vertex or mgp_graph, we had to be careful to update all methods using the mgp_vertex and mgp_graph struct to properly operate either on the whole graph or the projected graph, depending on the user request. It was up to us how that update would happen, but we knew we wanted to keep performance and make operating on the subgraph as performant as possible.

The first idea was to extend the mgp_graph to have reference to the Graph object holding the nodes and edges projected in the MATCH clause.

struct mgp_graph {
 memgraph::query::DbAccessor *impl;
 memgraph::query::Graph *graph;
 memgraph::storage::View view;
 …
};

But then, each function would require a check to confirm that the Vertex or Edge objects it’s returning are a part of the projected graph. So instead of the following function

mgp_error mgp_vertex_get_id(mgp_vertex *v, mgp_vertex_id *result) {
 return WrapExceptions([v] { return mgp_vertex_id{.as_int = v->impl.Gid().AsInt()}; }, result);
}

we would get a function that looks like this:

mgp_error mgp_vertex_get_id(mgp_vertex *v, mgp_vertex_id *result) {
 return WrapExceptions([v] {
   auto vertex_id =  v->impl.Gid().AsInt();
   if (!v->graph->projected_graph->IsVertexIdInGraph(vertex_id)){
     return nullptr;
   }
   return mgp_vertex_id{.as_int = vertex_id};
 }, result);
}

Obviously, this idea was a bust. It would increase repetition and new problems would arise with each new function. Also, we would like to have more control over the code.

Another idea was to create a class that could serve as DbAccessor, VertexAccessor, and EdgeAccessor together. Such a class would also hold a reference to a projected graph if one existed. If some information is needed from the VertexAccessor or EdgeAccessor, a reference to the object is passed from the functions C API calls to the appropriate method of this new class. Then and there we would have control over return values. But the problem with such a “cluster class" is that it would group responsibilities that should be clearly separated. That is why we discarded that idea as well.

It seemed like the best idea was the classic use of dynamic polymorphism: create a class that extends DbAccessor and make virtual functions from the functions that the C API functions will call. In spite of being a good idea, it didn’t come without its own set of problems.

Only pay for what you use: static polymorphism and std::variant

As it is known in the C++ dev community, dynamic polymorphism should be used when the data types are not known at compile time. In such cases, before each call to a virtual method, the compiler needs to look at the v-table and resolve the address of a derived function. Compilers try hard to devirtualize calls as much as possible, but more often than not, they fail. In our case, types are also unknown at compile time, but on the other hand, we don’t need the cost of checking the v-table every time the function is called. That’s why we decided to modify it a bit by using std::variant. This C++17 technique might not only improve performance and enrich value semantics but also enable interesting design patterns. The std::variant enables defining a list of types that can be stored in the same object. For example, std::variant<int, float, std::string> stores either an int, float or std::string. This is how mgp_graph looks in our case:

struct mgp_graph {
 std::variant<memgraph::query::DbAccessor *, memgraph::query::SubgraphDbAccessor *> impl;
 memgraph::storage::View view;
 …
};

The SubgraphDbAccessor class references DbAccessor and Graph, and gives us control over which vertices will be returned and what data will be inserted and removed from the projected graph and consequently database. We have also updated the mgp_vertex to hold the std::variant of possible implementations:

struct mgp_vertex {
std::variant<memgraph::query::VertexAccessor,memgraph::query::SubgraphVertexAccessor> impl;
 mgp_graph *graph;
};

SubgraphVertexAccessor also references the projected graph and the VertexAccessor.

We also utilized std::variant visit, which takes the variant object and calls the correct overload:

mgp_error mgp_vertex_get_id(mgp_vertex *v, mgp_vertex_id *result) {
 return WrapExceptions(
     [v] {
       return mgp_vertex_id{
           .as_int = std::visit(memgraph::utils::Overloaded{[](auto &impl) { return impl.Gid().AsInt(); }}, v->impl)};
     },
     result);
}

In the end, std::varaint and std::variant visit solved all of our problems.

Serialization

In the end, a little something about the serialization, or the output.

Since the Bolt protocol doesn’t support graph serialization, we decided to output the projected graph as a map containing two keys: nodes, containing the list of nodes, and edges, containing the list of relationships.

So this query:

MATCH p=(n:Node)-[:CONNECTS_TO]->(m:SpecificNode)
WITH project(p) as graph
RETURN graph;

will return a result looking like this {"nodes:":[node2, node2, …], "edges":[edge2, edge2, …]}.

Using graph projection feature on query modules

And here you have it! A whole new world of possibilities opened up. You can now do graph analysis with PageRank, degree centrality, betweenness centrality, or any other algorithm on subgraphs without any additional adjustments. You can fire up a graph machine learning algorithm, such as Temporal graph networks and split the dataset inside the query to do training and validation without splitting the dataset programmatically. And last but not least, you can fire up your graphics card and use cuGraph to run graph analysis in seconds. All of these and even more algorithms you can find in Memgraph MAGE library. If you have any questions feel free to join our Discord community, and if you find something’s missing, open an issue on the GitHub page for both MAGE and Memgraph.

This feature was really fun and challenging to implement, but we at Memgraph are really glad it saw the light of the day because we believe it will help a lot of people with their graph analysis.

If you like what you’ve read, feel free to give a star to our main GitHub repositories - Memgraph database and the MAGE repo. We also appreciate contributors!

Best of luck in your further endeavors!



from Hacker News https://ift.tt/dwgPBF0

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.