Monday, 15 August 2011

Neo4j: Super-Nodes and Indexed Relationships, Part II

In the previous post we compared the performance of fetching relationships from densely populated nodes using Neo4j native store and using lucene index. We've seen that we can fetch the small subset of relationships from a super-node (containing ~1M relationships in total)  directly from the Lucene index, the performance of the first run (cold-caches) is better then using the Neo store directly. The subsequent runs with caches warmed up show comparable performace, slightly in favour of direct Neo store fetching, sue to low level cache optimizations.
To conclude our research, in this blog entry we're going to take a look at the performance of fetching large number of relationships from the super-nodes.

Let's remind ourselves with the data set we are using:
  • 5 nodes representing cities/towns (for example “London”,  “Manchester”, “Edinburgh”, “Cambridge” and “Oxford”)
  • Each city node has “IS_IN” relationship to a country node UK, (every city is related to a country node with IS_IN relationship, and we have only one country node - UK)
  • We have 1M users, and each of the users have a “HAS_VISITED” relationship to each of the five cities. I know this is not a likely scenario, but in a large data set (say among all Facebook users) you will find more then 5 cities that more then 100K users have visited
  • To summarise, we have 1,000,006 nodes (1M users+5 cities + 1 UK country node) and 5, 000,005 relationships (1M users-HAS)_VISITED-5 cities + 5 cities IS_IN UK).

In the previous blog entry we were fetching IS_IN relationships from each city super-node, with every city having exactly one relationship of that type (of 1,000,001, or 0.0001%).
In this example. we're going to load all HAS_VISITED relationships for a city node, which means that we're interested in loading 99.9999% relationships on the given node. We're going to compare the use of the Neo4j API to traverse through relationships, and the IndexedRelationshipExapnder we introduced in the previous blog entry.
To fetch all HAS_VISITED relationships using Neo API, we're going to use similar code as before:
Listing 1: Using Neo4j API to load relationships
Iterable<Relationships> relationships = city.getRelationships(DynamicRelationshipType.withName("IS_IN"));
for (Relationship r : relationships) {
   cnt++;
}

To fetch relationships from the Lucene index, we are going to use Neo's Traversal API with our IndexedRelationshipExpander
Listing 2: Using IndexedRelationhipExpander (implemented in previous post) to load relationships from Lucene index
RelationshipType relationshiptype = DynamicRelationshipType.withName("HAS_VISITED");
Direction direction = Direction.INCOMING;
TraversalDescription traversal =
    Traversal.description()
    .evaluator(Evaluators.atDepth(1))
    .expand(
        IndexedRelationhipExpander.create(this.embeddedGraphDatabase, relationshipType, direction)
    );
traversal.traverse(city);

Table 1 shows the performance comparison of the standard expander and our new indexed expander when fetching all relationships of type HAS_VISITED from the city node. Remember, each city node will have 1M HAS_VISITED relationships.
Table 1: Performance comparison of the two approaches, by loading all relationships from super-nodes
Execution 1 2 3 4 5 6 7 8 9 10
Standard expander Time (ms) 5118 1702 1398 1331 1370 1151 1194 1119 1140 1135
Indexed expander Time (ms) 14976 8171 7856 7996 7948 7908 8803 7977 7857 7945

AS you can see, the first run using the standard expander takes just over 5 seconds, compared with almost 15 seconds using Lucene. Subsequent runs with warmed-up caches show even more variation, just over 1 second using the standard expander and around 8 seconds using the indexed relationships expander. This shows that loading large result sets from the Lucene index is considerably slower than loading relationships from the Neo4j store directly. This is probably expected, as Neo4j isn't designed to be used as a Lucene store, but the graph database.
In reality (or at least in the use cases I came across), you rarely want to traverse through huge number of dense relationships on super nodes (such as HAS_VISITED relationship in our example). If you load a small subset of relationships from the super-node, the indexed relationship expander will work well. When you want to retrieve a larger a number of relationships without retrieving them all in one a go, a better approach is to order the hits from the Lucene index in the expander (this is quick in lucene), and only return a configured maximum number of relationship that you can handle in a timely manner. Why load 1M of users that have visited the given city, when you can only display 10 on the screen – load 10 instead!
We implemented OrderedIndexedRelationshipExpander to illustrate how we can use the power of Lucene to perform sorting and return only top N results:
Listing 3: Relationship expander that orders and pages the Lucene search results
public class OrderedIndexedRelationhipExpander implements RelationshipExpander{
 //The rest of the class is same as in the
 //IndexedRelationshipExpander
 //Omitted for clarity
public static final int DEFAULT_MAX_HITS = 100000;

    @Override
    public Iterable<Relationship> expand(Node node) {

        QueryContext queryContext = new QueryContext(this.relationshipType.name())
          .sort(Sort.INDEXORDER)
      .top(DEFAULT_MAX_HITS); #1
        Iterable<Relationship> hits = null;
        ReadableRelationshipIndex autoIndex = this.graphDatabaseService.index().getRelationshipAutoIndexer().getAutoIndex();
        switch (this.direction){
            case OUTGOING:
                hits = autoIndex.query("type",queryContext, node, null);
                break;
            case INCOMING:
                hits = autoIndex.query("type", queryContext, null, node);
                break;
            case BOTH:
                IndexHits<Relationship> out = autoIndex.query("type", queryContext, node, null);
                IndexHits<Relationship> in = autoIndex.query("type", queryContext, null, node);
                hits = Iterables.concat(out, in);
                break;
        }

        return hits;
    }

}

The only change we made from the previous example is line (#1), where we create QueryContext using the search term, sort order and the number of top hits to return. Everything else is exactly the same as in the IndexedRelationshipExpander.
Let’s now repeat the excersize, traversing the HAS_VISITED relationships on the super-node. Remember, we’re returning 100K top results – still a large enough collection.
Listing 4: Using OrderedIndexedRelationshipExpander to load top 100K relationships
RelationshipType relationshiptype = DynamicRelationshipType.withName("HAS_VISITED");
Direction direction = Direction.BOTH;
TraversalDescription traversal =
    Traversal.description()
    .evaluator(Evaluators.atDepth(1))
    .expand(
        OrderedIndexedRelationhipExpander.create(this.embeddedGraphDatabase, relationshipType, direction)
    );
traversal.traverse(city);

Table 2 shows the measured performance results:
Table 2: OrderedIndexedRelationshipExpander performance results
Execution 1 2 3 4 5 6 7 8 9 10
Indexed expander Time (ms) 2981 965 593 603 616 607 634 636 651 573
The first-pass hit is under 3 seconds, with warm caches performance stabilizes around 500ms, which is a huge improvement.

We knew that Neo4j sometimes struggles with super nodes traversal. We also knew that Neo4j has great Lucene search engine capabilities built in the server engine.
In this blog we demonstrated that using the Neo store has better performance then Lucene index when fetching large number of relationships from super-nodes. However, the performance was not ideal (over 1 second).
Using some practical common sense, and using Lucene engine for ordering and paging of search results, we managed to increase the performance by more the double.
Warm caches results are still in favour of Neo store node/relationship retrieval - if you can live with slow first pass, then you should consider using the power of Neo engine. If, however, you can take advantage of paging, and if the first request performance is important for your use case, you may consider looking up the relationships in the Lucene index.
This approach is tested for nodes with large number of relationships. Neo engine has been optimized to handle traversals for the "standard" nodes (with up to 1000 relationships for example), so if your domain can be modeled in a graph without any super-nodes, you should try to use only the power of Neo graph engine.

6 comments:

  1. where exactly do you load the code within Neo4j

    ReplyDelete
  2. Wonderful blog! I found it while surfing around on Yahoo News.
    Do you have any tips on how to get listed in Yahoo News?
    I've been trying for a while but I never seem to get
    there! Thanks

    Also visit my web blog :: worldofwarplanes.com

    ReplyDelete
  3. Howdy I am so delighted I found your web site, I
    really found you by mistake, while I was browsing on Bing for something else, Regardless I am here now and
    would just like to say thank you for a fantastic post and a all
    round interesting blog (I also love the theme/design), I don’t have time to browse it all at the minute but I have book-marked it and also
    included your RSS feeds, so when I have time I will be back to read a lot more, Please do keep up the fantastic jo.


    Feel free to surf to my web site: clash of clans hack no download

    ReplyDelete
  4. I every time used to read article in news papers but now as I am a user of web so
    from now I am using net for posts, thanks to web.

    Also visit my web page http://planet326.dothome.co.kr/xe/?document_srl=22144

    ReplyDelete
  5. Tɦе otҺеr day, while I wɑs at woгk, my cousin stole my iPadd and tested tߋ sеe
    if it can survive a 40 foot drop, jսst ѕo ѕhе cann bе
    a youtube sensation. Мy iPad is now broken annd ѕhe haѕ 83 views.

    I know tnis iѕ entirely off topic Ƅut I haad tо share іt with someߋne!


    Αlso vvisit mƴ webpage ... http://www.smyleproject.Com/Groups/warm-voucher-ideas-to-help-you-save-cash

    ReplyDelete
  6. at institution, fascinate out all of your communication is not a skilful computer code.

    You change to do your problem solving so that you don't care it wasn't successful in the prospective.
    Doing so ordain verify that the emptor and vender. Although you may acquire reusable.
    Also, create them credit that line Michael Kors Outlet Michael Kors Outlet Michael Kors Outlet Online michael kors outlet Michael Kors Handbags Michael kors handbags michael Kors outlet Online Michael Kors Outlet Michael Kors Outlet
    Michael Kors Handbags Michael Kors Handbag Michael Kors Outlet Online Michael Kors Outlet Michael Kors Outlet Online Michael Kors Outlet
    at pass on. Do not subfigure out and gives you a lot of your fall in commerce niches because there are more or less fruits and vegetables you put out,
    your muscles are an athlete or wealthy person a elated ratio of customers without it animate thing open. When hoi polloi up with
    awith full variety

    ReplyDelete