Neo4j Internal


Neo4j中数据是如何存储的

电子书

GraphDatabase

存储结构

现在让我们来看看数据在Neo4j中是如何存储的:

.db是主存储文件,不同存储类型(node,relation,label,property)对应会有,例如neostore.propertystore.db是属性的主存储文件

.id主要保存最大ID(数组长度)可复用ID(删除过),节点ID对应存储文件数组下标,例如neostore.propertystore.db.id是属性的主存储文件

另外存储文件包含层次结构,一般几个文件配合完成一种类型存储
neostore.propertystore.db是属性的主存储文件
neostore.propertystore.db.index存储属性key的id(key编码保存)
neostore.propertystore.db.index.keys存储id相关的字符串名,属性value如果不是基本型(string,array),对应分别存储到neostore.propertystore.db.strings和neostore.propertystore.db.arrays中

首先看看节点Node文件

主文件neostore.nodestore.db的格式(早期版本,后期版本node_id,relationship_id,property_id高位会保存在in_use字节):

Node:in_use(byte)+next_rel_id(int)+next_prop_id(int)+labels(5)+extra(byte)

每一位的具体意义如下:

另外neostore.nodestore.db.labels存储节点label信息

再看看Relation格式:

in_use+first_node+second_node+rel_type+first_prev_rel_id+first_next_rel_id+second_prev_rel_id+second_next_rel_id+next_prop_id

在这个例子中,A~E表示Node 的编号,R1~R7 表示 Relationship 编号,P1~P10 表示Property 的编号。

Node 的存储示例图如下,每个Node 保存了第1个Property 和 第1个Relationship:

关系的存储示意图如下:

下面举一个图的遍历的例子:

最后结果如下:

(1)–[KNOWS,2]–>(4)
(1)–[KNOWS,1]–>(3)
(1)–[KNOWS,0]–>(2)
(1)–[KNOWS,1]–>(3)–[KNOWS,5]–>(7)
(1)–[KNOWS,1]–>(3)–[KNOWS,4]–>(6)
(1)–[KNOWS,1]–>(3)–[KNOWS,3]–>(5)

执行计划

Using the Neo4j web-based console, we can get the results of each query in the detailed query results.

Graph Setup

create (Neo:Crew {name:'Neo'}), (Morpheus:Crew {name: 'Morpheus'}), (Trinity:Crew {name: 'Trinity'}), (Cypher:Crew:Matrix {name: 'Cypher'}), (Smith:Matrix {name: 'Agent Smith'}), (Architect:Matrix {name:'The Architect'}),
(Neo)-[:KNOWS]->(Morpheus), (Neo)-[:LOVES]->(Trinity), (Morpheus)-[:KNOWS]->(Trinity),
(Morpheus)-[:KNOWS]->(Cypher), (Cypher)-[:KNOWS]->(Smith), (Smith)-[:CODED_BY]->(Architect)

run a slightly more complex Cypher query, the execution plan shows the piping of operations.

 // Find Kenny's friends
MATCH (kenny:Person {name:"Kenny"})-[:FRIEND_OF]-(friends)
RETURN friends
Query Results


+-----------------------+
| friends               |
+-----------------------+
| Node[1]{name:"Adam"}  |
| Node[2]{name:"Greta"} |
+-----------------------+
2 rows
14 ms


Execution Plan


ColumnFilter
  |
  +Filter
    |
    +TraversalMatcher

+------------------+------+--------+-------------+--------------------------------------------+
|         Operator | Rows | DbHits | Identifiers |                                      Other |
+------------------+------+--------+-------------+--------------------------------------------+
|     ColumnFilter |    2 |      0 |             |                       keep columns friends |
|           Filter |    2 |     12 |             | Property(kenny,name(0)) == {  AUTOSTRING0} |
| TraversalMatcher |    6 |     13 |             |              friends,   UNNAMED37, friends |
+------------------+------+--------+-------------+--------------------------------------------+

The execution plan for the query shows 3 operators: ColumnFilter, Filter, and TraversalMatcher.
The ColumnFilter operation receives a row of data from the Filter operation and processes it by keeping only the identifier “friends”, which is in the RETURN statement.
The Filter operation also receives rows of data from its preceding operation, the TraversalMatcher, and applies a predicate to decide whether to pass that data row along to the next operation (the ColumnFilter) or to discard it.
In the case of our query, the predicate is to filter nodes by applying the criteria for the identifier “kenny” with the property “name” that equals AUTOSTRING0, which will resolve to the token “Kenny” when the plan is executed.
Finally the TraversalMatcher doesn’t receive any rows of data, due to being the first operation, but generates new rows of data by searching the graph for the pattern specified in the MATCH clause.
So you can see that the execution plan, as constructed from the Cypher query, operates bottom up. Patterns are found in the TraversalMatcher (a total of 6 rows), and then passed through the Filter, which only allows 2 through, and finally to the ColumnFilter, which only keeps the “friends” column specified in the RETURN clause.
For detailed query tuning options, check the tuning guide