Monday, 25 July 2011

Neo4j: Super-Nodes and Indexed Relationships, Part I

As part of my recent work for Open Credo, I have been involved in the project that was using Neo4J Graph database for application storage.
Neo4J is one of the first graph databases to appear on the global market. Being open source, in addition to its power and simplicity in supporting graph data model it represents good choice for production-ready graph database.
However, there has been one area I have struggled to get good-enough performance from Neo4j recently – super nodes.
Super nodes represent nodes with dense relationships (100K or more), which are quickly becoming bottlenecks in graph traversal algorithms when using Neo4J. I have tried many different approaches to get around this problem, but introduction of auto indexing in Neo4j 1.4 gave me an idea that I had success with. The approach I took is to try to fetch relationships of the super nodes using Lucene indexes, instead of using standard Neo APIs. In this entry I’ll share what I managed to achieve and how.
To start, let’s generate a sample graph with super-nodes. Here is the amount of generated data:
  • 5 nodes representing cities/towns (for example “London”,  “Manchester”, “Edinburgh”, “Cambridge” and “Oxford”)
  • Each city node has “IS_IN” relationship to a country node UK, (every city is related to a country node with IS_IN relationship, and we have only one country node - UK)
  • We have 100K users, and each of the users have a “HAS_VISITED” relationship to each of the five cities. I know this is not a likely scenario, but in a large data set (say among all Facebook users) you will find more then 5 cities that more then 100K users have visited
  • That’s in total 100,006 nodes (100K users+5 cities + 1 UK country node) and 500,005 relationships (100K users-HAS)_VISITED-5 cities + 5 cities IS_IN UK)
That’s not a huge amount of data, but it’s large enough to start with.
Let’s say that we want to load “IS_IN” relationships for a given city node. It’s quite simple thing to do using Neo API:


Listing 1: Iterating through node relationships using standard Neo4j API


Iterable relationships =
city.getRelationships(DynamicRelationshipType.withName("IS_IN"));
for (Relationship r : relationships) {
cnt++;
}

How long does this take for a single city node? I have run this code snipped repeatedly 10 times (to account for the cache hits in subsequent runs), and here are the results:

Table 1: Performance of fetching relationships from super-node with 100K relationships, using standard Neo4J API

Execution
1
2
3
4
5
6
7
8
9
10
Time elapsed (ms)
1664
<1
<1
1
<1
<1
<1
1
<1
<1


The first time the relationship is fetched, it takes 1.6 seconds. Each subsequent request remarkably takes less then 1 ms! We can thank Neo4j’s caching and mapped memory mechanism for this increase in speed (or the developers who implemented it to be more precise).
But I was still worried about the performance of the  first run. For a single relationship fetched (there is only one “IS_IN” relationship for each city node), 1.6 seconds seems a lot. The reason for the poor performance is that a city node has 100,001 relationships in total, and only one of those is of type IS_IN. In order to fetch the right one, Neo4j engine will have to look at all the relationships, then compare their types, and return only those that match the requested type.
While the system would perform well when the cache warms up, this can still cause a problem if you often access “cold” parts of the graph. While Neo4j tries to keep most of the data in caches or mapped-memory, this cannot always be achieved. When the graph is to large to fit in entirely If the super-nodes cannot be kept in memory all the time (either because there is too many of them, or the data more frequently accessed takes all cache/mapped memory available), then the performance will be adversely affected.
In terms of large data systems 100K relationships per node is not an awful lot. What would happen if you have super nodes with 1M relationships? Let’s see for ourselves!
I’m going to keep 5 city nodes in the graph, with each one having one IS_IN relationship with the UK country node. But I’ll increase number of users to 1,000,000. And assume that each user visitied each of 5 cities. So we’ll have 1,000,001 relationships per city node. I repeated the code from listing 1 to fetch the relationships of type IS_IN (there will be only one), and repeat it 10 times to see the results when cache kicks in. Table 2 shows the results.

Table 2: Performance of fetching relationships from super-node with 1M relationships, using standard Neo4J API

Execution:
1
2
3
4
5
6
7
8
9
10
Time elapsed (millis):
4805
<1
<1
1
<1
<1
<1
1
<1
<1


The first time run is now close to 5 seconds, with subsequent calls which hit the cache still taking less then 1ms!
The performance degradation is not linear (with input increased 10 times, the time was only 3 times longer) – and that’s good. When its warm, the system works well with sub-ms responses.
But sometimes our requirements do not allow for the warming-up time. If you have a traversal that needs to pass through 10 super nodes, you may be getting close to 1 min a traversal. If you have a batch job that has to process 100K such traversals, and the data set is such that it cannot be stored in memory all the time, it may take 100K minutes, or more then a month to complete! What can we do to improve this?
One option can be to deploy HA Neo4j cluster, and partition calculation across the cluster. I haven’t tried this, and depending on your data set size and requirements it may be a viable solution.
But we’re going to concentrate on the simpler solution here: how can we fetch only relationships of the required type, ignoring the huge number of relationship of other types that we’re not interested in. We’re going to use Neo4j’s built-in Lucene indexing support to fetch the required relationships from the index directly. In order to do that, we’ll need to add the relationship type to the index whenever new relationship is created.
Firstly, we are going to configure Neo4J ‘s relationship auto-indexing capability by adding following properties to the neo4j.properties file, as listing 2 illustrates.


Listing 2: Configuring relationship auto indexing in neo4j.properties file

relationship_auto_indexing=true
relationship_keys_indexable=type

We are enabling the relationship auto-indexing using relationship_auto_indexing, property, and we are going to tell Neo to track and index property name type of every relationship created.
We have to make a slight change to our data generator and set the type property for every relationship we create:


Listing 3: Creating relationship with the type name as property, to enable relationship auto-indexing


DynamicRelationshipType relType = DynamicRelationshipType.withName("IS_IN");
Relationship isA = city.createRelationshipTo(ukCountry, relType);
isA.setProperty("type", relType.name());
DynamicRelationshipType hasVisited = DynamicRelationshipType.withName("HAS_VISITED");
Relationship rel = user.createRelationshipTo(city, hasVisited);
rel.setProperty("type", hasVisited.name());

Having inserted the data for all 1M users, we are going to fetch IS_IN relationships for a single city node using auto index:


Listing 4: Creating relationship with the type name as property, to enable relationship auto-indexing


Iterable relationships = autoIndex.query("type", "IS_IN", city, null);

for (Relationship r : relationships) {
cnt++;
}

Now let’s measure how long loading the relationships from the index takes – as in the previous runs, we will fetch the relationships 10 times to allow the cache to wam up. Table 3 shows the results:

Table 3: Performance of fetching relationships from super-node with 1M relationships, using relationship auto-indexing 

Execution: 1 2 34 5 6 7 8 9 10
Time elapsed (millis):12 <1 <1 1 <1 <1 <1 1 <1 <1


Thanks to the Lucene index, the “cold” fetch has taken only 12ms on this run. The subsequent calls are still very quick, although slightly slower than before, as now we see a few 1ms times among majority of sub ms calls.
In the previous example we saw how we can ensure a more efficient lookup of relationships of particular type.  Where we are interested in traversal operations ensuring we retrieve only the relationships we are interested in traversing requires the use of a custom RelationshipExpander . The code in listing listing 5 shows an implementation of the relationship expander which ensures we use the Lucene index to retrieve relationships of a specified type.


Listing 5: Implementation of the IndexedRelationshipExpander that uses Lucene index to fetch relationships for node


public class IndexedRelationshipExpander implements RelationshipExpander{
private final GraphDatabaseService graphDatabaseService;
private final RelationshipType relationshipType;
private final Direction direction;


private IndexedRelationshipExpander(GraphDatabaseService graphDatabaseService, RelationshipType relationshipType, Direction direction) {
this.relationshipType = relationshipType;
this.graphDatabaseService = graphDatabaseService;
this.direction = direction;
}


public static IndexedRelationshipExpander create(GraphDatabaseService graphDatabaseService, RelationshipType relationshipType, Direction direction){
return new IndexedRelationshipExpander(graphDatabaseService, relationshipType, direction); #1
}

@Override
public Iterable expand(Node node) {
Iterable hits = null;

ReadableRelationshipIndex autoIndex = this.graphDatabaseService.index().getRelationshipAutoIndexer().getAutoIndex();
switch (this.direction){
case OUTGOING:
hits = autoIndex.query("type", this.relationshipType.name(), node, null); #2
break;
case INCOMING:
hits = autoIndex.query("type", this.relationshipType.name(), null, node); #3
break;
case BOTH:
IndexHits out = autoIndex.query("type", this.relationshipType.name(), node, null);
IndexHits in = autoIndex.query("type", this.relationshipType.name(), null, node);
hits = Iterables.concat(out, in); #4
break;
}
return hits;
}

@Override
public RelationshipExpander reversed() {
return new IndexedRelationshipExpander(graphDatabaseService, relationshipType, direction.reverse());
}
}

In order to access the index, the IndexedRelationshipExpander requires a reference to the GraphDatabaseService. In addition, we need to provide the relationship type and direction we are interested in to the IndexedRelationshipExpander (#1). The implementation of the  expand() method simply checks the direction of the relationship, and queries the relationship auto-index (#2, #3 – the difference is only in the startNode/endNode arguments). Where we need relationships in both directions, we have to perform two separate searches one for incoming and one for outgoing relationships. Then we need to concatenate the two results,  in the example this is done using the Google Guava library (#4)
And that’s all. We can now traverse the relationships of the specified type  using the new relationship expander:


Listing 6: Using custom IndexedRelationshipExpander in Neo4J Traversal API


RelationshipType relationshiptype = DynamicRelationshipType.withName("IS_IN");
Direction direction = Direction.OUTGOING;
TraversalDescription traversal = Traversal.description().
evaluator(Evaluators.atDepth(1)).expand(IndexedRelationhipExpande.create(this.embeddedGraphDatabase, relationshipType, direction));
traversal.traverse(city);

NOTE that you have to have indexed type property on every relationship, which contain the name of the relationship type so the IndexedRelationshipExpander works correctlyGiven this basic implementation of a RelationshipExpander, it’s easy to adapt it for you specific requirements. For example, you can index the creation_date property for every relationship and then traverse only relationships created after a given date. You can include any indexed relationship property when creating a Lucene query, and implement it in the RelationshipExpander.expand() method.
We have now seen how we can use Lucene indexing to quickly fetch and traverse  a small subset of the relationships on the super node.  The effectiveness of this approach is however dependent on you being interested in only a small proportion of the relationships.  In the next blog entry we will explore how we can efficiently load a larger number of relationships from a supernode.
n.b. I used Neo4j 1.4 release to run samples described in this blog. The indexed relationship expander described wont work with earlier version of Neo4J as it depends on auto indexing capabilities on Neo4j.All tests are performed on 2.66GHz i7 MacBook Pro with 8GB RAM memory. Java heap size was set to 2GB and default Neo4J mapped memory settings were used. 
n.b.b.  There has been some development around super-node traversal improvements from the Neo4J guys, it can be found here https://github.com/peterneubauer/graph-collections. Once I try the new features, I'll let you know here in one of the next blog entries.
NOTE: The second part of this blog entry, which covers fetching large number of relationships from a super node, is now available here












16 comments:

  1. Hi,
    First of all thanks for the good post ;)
    I came across the same problem in Bio4j project (http://www.bio4j.com) and was thinking of indexing relationships by hand.
    I guess with this new auto-indexing feature will be easier; still I don't like much the thing with indexing relationships... I'd rather have a more powerful mechanism implemented at core level that allowed distinguishing between different relationship types at a decent speed.
    Cheers,

    Pablo

    ReplyDelete
  2. Hi,

    nice post.
    I adapted this post and realized it using the sones GraphDB. This has the advantage that the edges will not be indexed. So you can address the edges of the cities directly. That is why it doesn't matter if there are 100k or 500M edges on those supernodes. Here are my results:
    http://yfrog.com/gy4bkip

    If you have any questions contact me via henning.rauch@sones.com or via twitter -> cosh23

    Greetings, Henning.

    ReplyDelete
  3. @Henning: Your results bury that you are comparing a persistent graph database with an in-memory solution. You did not say anything about the deserialization of the edges or hyperedges and as far as I know your database has to deserialze the entire vertex to access any data stored within.

    Cheers,
    Achim

    ReplyDelete
  4. @AchimFriedland: well, that's not the point. My post was about architecture and not the speed of InMemory graphs. I wanted to make clear that there is a significant speedup when using PropertyIDs for certain edges. So that you are able to request the "IS_IN" edge in exactly ONE step and not to search all edges for it.
    Yes, you are right... an InMemory-solution is faster than a persistent one. But even a InMemory-version of the scenario described above would have to search that 100001 edges for the desired one. So it would also be slower than the version I brought forward.

    If you have any questions concerning the architecture: http://forum.sones.de

    Greetings, Henning.

    ReplyDelete
  5. Because the admin of this web page is working, no question
    very rapidly it will be well-known, due to its feature contents.


    my homepage ... book of ra online spielen

    ReplyDelete
  6. Hi! Someone in my Myspace group shared this site
    with us so I came to give it a look. I'm definitely enjoying the information. I'm book-marking and
    will be tweeting this to my followers! Superb blog and great style and design.


    My homepage; book of ra kostenlos f�roid

    ReplyDelete
  7. I every time spent my half an hour to read this webpage's articles or reviews all the time along with a cup of coffee.

    my web site :: book of ra download chip online

    ReplyDelete
  8. Tebrikler siteniz çok güzel olmuş

    ReplyDelete
  9. I take pleasure in, result in I discovered exactly what
    I used to be having a look for. You've ended my 4 day long
    hunt! God Bless you man. Have a great day. Bye

    my weblog Nike Free 5.0 V4

    ReplyDelete
  10. Wonderful article! That is the type of information that
    should be shared around the net. Shame on Google for no longer positioning this submit higher!
    Come on over and visit my website . Thank you =)

    Have a look at my page liftderma (krank.kr)

    ReplyDelete
  11. Hi there to all, how is the whole thing, I think every
    one is getting more from this site, and your views are nice designed for new users.


    My blog post ... buy a house in the uk

    ReplyDelete
  12. Hey! This post could not be writteen any better! Reading this post reminds me of my good old room mate!
    He always kept chatging about this. I will forward
    this page tto him. Fairly certain he willl have a good
    read. Thank you for sharing!

    Look into my website :: throne rush hack ios (tinyurl.com)

    ReplyDelete
  13. Link exchange is nothing else but it is just placing the other person's
    webpage link on your page at suitable place and other person will
    also do same in support of you.

    Feel free to surf to my web site ... best only

    ReplyDelete
  14. Therefore, taking too many times in the bloodstream.
    Too much protein to turn to be willing to release amino acids are the side-effects which result
    from many strength training as a one pound of the title.

    And no, you can save money and shed regulate of the adult exercises are working out.


    Look into my webpage ... best protein powder

    ReplyDelete
  15. Hi! Do you know if they make any plugins to help with SEO?
    I'm trying to get my blog to rank for some targeted
    keywords but I'm not seeing very good gains. If you know of any please share.
    Cheers!

    Here is my web-site; protein shakes for weight loss

    ReplyDelete
  16. Artificial baits include spoons, jigs, ribbon baits, rattler baits & streamer baits.
    For instance at a time when anglers where sticking rigidly to the mantra of 3 to 5 milliliters of flavour, plus 1 to 2 milliliters
    of intense sweetener to a pound of base mix or more, as measured
    by 4 large eggs or 6 medium eggs as a basis of
    measurement, I tried other approaches and reaped big rewards against much more
    experienced anglers. If you are getting ready to go deep-sea fishing,
    there are a few things you might want to know about sea fishing bait.



    Feel free to visit my web site dunnage how to do box braids

    ReplyDelete