DistribNet
(Draft)

Kevin Atkinson (kevin at atkinson dhs org)

Project Page: http://distribnet.sourceforge.net/
Mailing list: http://lists.sourceforge.net/lists/listinfo/distribnet-devel

Abstract:

DistribNet is a global peer-to-peer Internet file system in which anyone can tap into or add content to. This paper gives an overview of DistribNet.

1 Overview

NOTE

This paper was initially written in response to my dislike of Freenet and similar network that focus on complete anonymity. Thus a large deal of the comments are directed towards that community. My ultimate goal is to design a general purpose p2p network and not just something to replace of Freenet. In fact in some ways DistribNet won't replace Freenet due to the anonymity issue. I plan on eventually modifying this paper accordantly to address the general p2p (and web) community. For now please keep the intended audience in mind when reading this section.

1.1 Main Goal

1.2 (Possibly Impossible) Goals

1.3 Applications

I would like the protocol to be able to effectually support (ie with out any ugly hacks that many of the application for Freenet use)

And maybe:

1.4 Anti-Goals

Also see philosophy for why I don't find these issues that important

1.5 Philosophy

2 DistribNet Architecture

Two types of keys: Map and Data keys. Maps keys are hashed based on there identification and can be updated, Data keys are hashed based on there content and consequently can not be updated.

There will be three type of storage of keys, Permanent, Cache, and Pointers. Permanent keys will be used to ensure the available of content, the cache will be used exactly like a typical cache will be used, and pointers will be used to be be able to find content.

Map keys will be routed based on the SHA-1 hash on the identification using a Pastry[4] like system. Data are not routed and will be stored based on where they are retrieve. Map keys will be used to be able to find data keys.

2.1 Key Types

There will essentially be two types of keys. Map keys and data keys. Map keys will be uniquely identified in a similar manner as freenet SSK keys. Data keys will be identified in a similar manner as freenet's CHK keys.

Map keys will contain the following information:

At any given point in time each map key will only be associated with one index pointer and one data pointer. Map keys can be updated by appending a new index or data pointer to the existing list. By default, when a map key is queried only the most recent pointer will be returned. However, older pointers are still there and may be retrieved by specifying a specific date. Thus, map keys may be updated, but information is never lost or overwritten.

Data keys will be very much like freenet's CHK keys except that they will not be encrypted. Since they are not encrypted delta compression may be used to save space.

There will not be anything like freenet's KSK keys as those proved to be completely insure. Instead Map keys may be requested with out a signature. If there is more than one map key by that name than a list of keys is presented sorted by popularity. To make such a list meaning full every public key in freenet will have a descriptive string associated with it.

2.1.1 Data Key Details

Data keys will be stored in maximum size blocks of just under 32K. If an object is larger than 32K it will be broken down into smaller size chunks and an index block, also with a maximum size of about 32K, will be created so that the final object can be reassembled. If an object is too big to be indexed by one index block the index blocks themselves will be split up. This can be done as many times as necessary therefore providing the ability to store files of arbitrary size. DistribNet will use 64 bit integers to store the file size therefore supporting file sizes up to 2^64-1 bytes.

Data keys will be retrieved by blocks rather than all at once. When a client first requests a data key that is too large to fit in a block an index block will be returned. It is then up the client to figure out how to retrieve the individual blocks.

Please note that even though that blocks are retrieved individually they are not treated as truly independent keys by the nodes. For example a node can be asked which blocks it has based on a given index block rather than having to ask for each and every data block. Also, nodes maintain persistent connections so that blocks can be retrieved one after another without having to re-establish to connection each time.

Data and index blocks will be indexed based on the SHA-1 hash of there contents. The exact numbers of as follows:

\begin{figure*}\begin{center}\begin{tabular}{\vert l\vert l\vert}
\hline
Data B...
...bytes~~~~~~~~~~~~~~~~~~~\end{center}\end{ttfamily} \end{list} \par
\end{figure*}

Why 32640?

A block size of just under 32K was chosen because I wanted a size which will allow most text files to fit in one block, most other files with one level of indexing, and just about anything anybody would think of transferring on a public network in two levels and 32K worked out perfectly. Also, files around 32K are rather rare therefor preventing a lot of of unnecessary splitting of files that don't quite make it. 32640 rather than exactly 32K was chosen to allow some additional information to be transfered with the block without pushing the total size over 32K. 32640 can also be stored nicely in a 16 bit integer without having to worry if its signed or unsigned.

However, the exact block size is not fixed in stone. If, at a latter date, a different block size is deemed to be more appropriate than this number can be changed.

2.2 Storage

Permanent keys will be distributed essentially randomly. However, to insure availability the network will insure at least N nodes contain the data. Nodes which are responsible for maintaining a permanent key will know about all the other nodes on the network with are also responsible for that key. From time to time it will check up on the other nodes to make sure they are still live and if less than N-1 other nodes are live it will pick another node to to ask to maintain a copy of the key. It will first try nodes which already have the key in its cache and if they all refuse or none of them do. It will chose a random node to ask and will keep trying until some node accepts or one the original nodes becomes live again.

Cached keys will be DistribNet based on where it will do the most good performance wise. How cache keys will be managed is still undecided. For the first implementation it will likely be stored on the nodes which have previously requested the key.

Pointer keys will basically be distributed based on the numeric distance of the hash of the key from the hash of the node's identification. Since pointer keys contain very little data they will be an extremely large amount of redundancy. Pointer keys will contain two types of pointers. Pointers to permanent keys and pointers to permanent keys. Pointer keys on different nodes will all contain the same permanent pointers but will only contain pointers to cached keys to nodes which or near by. There will be an upper limit to the number of pointers within an pointer key any one node will have.

3 DistribNet Routing

The routing algorithm for DistribNet was originally based on Pastry[4], but several key modifications were made. It is now probably closer to Kademlia[5] than Pastry, however it still maintains several elements of Pastry. This section will assumes the reader is family with Pastry and will focus on how it differers from such.

Each node on DistribNet is uniquely identifies by the 160-bit SHA-1 hash of the public key. Since SHA-1 hashes are used the nodes will be evenly distributed. (Note: For efficiency reasons I am strongly considering only using the first 64 bits of the key and not allow nodes with duplicate keys to join the network.) Keys in DistribNet are stored based solely the XOR metric in the same manor as is done in Kademlia. In the XOR metric distance is determined by taking the exclusive or of the keys and then treating the resulting value of an integer in big endian format. For example the keys 0xF0 and 0x1F will have a distance of 0xDF. The XOR metric is used from start to finish, unlike with Pastry which switches to using numerical distance for the last hop.

The routing table in DistribNet contains 8 rows with each row containing 2^4 entries each. In general, DistribNet tries to maintain at least two nodes for each entry. The number of rows does not need to be fixed and it can change based on the network size. It may also be possible that the number of entries per row does not necessarily have to be fixed. However, This idea has not been exported in depth. Four was chosen as the base size for several reasons 1) it is a power of two, 2) when keys are thought of as a sequence of digits a base size of 4 means that the digits will be hexadecimal, 3) the Pastry paper hinted that 4 would be a good choice. The number of rows was chosen to be large enough so that there is no possibility that the last row will be used when dealing with a moderate size network during testing.

Unlike Pastry their is no real Leaf set. Instead the ``leaf set'' consists of all rows which are not ``full''. A full row is a row which contains 15 full entries with am extra empty entry being the one which represents the common digit with the node's key, and thus will never be used. Not having a true ``leaf set'' simplifies the implementation since a separate list does not need to be maintained and the routing algorithm remains the same instead of switching to numerical distance as Pastry does. This also means that all the nodes in the leaf set will maintain the same set. I have not determined if this is a good or bad thing. It also has not been determined if maintaining any sort of leaf set at all is necessary, as Kademlia does not. I, however, believe that maintaining a leaf set will lead to a more robust network that is less sensitive to network failures.

A row is considered full in DistribNet if 15 of the 16 entries are full in the current node AND other nodes on that row also have 15 of the 16 entries full. For each full row DistribNet will try to maintain at least two nodes for each entry. This way if one node goes down the other one can be used without effecting performance. When a node is determined to be down (as oppose to being momentary offline) DistribNet will try to replace it with another node that is up. With this arraignment is is extremely likely that at least one on the two nodes will be available. A full row can become a leaf row if the entry count drops below 15.

For each non-full row (ie in the Leaf Set) DistribNet will attempt to maintain as many nodes as are available for that entry so that every other node in the leaf set is accounted for. From time to time DistribNet will contact another node in the leaf set and synchronize its leaf set with it. This is possible because all nodes in the leaf set will have the same set. Down nodes in the leaf set will be removed, but the criteria for when a node is down for a leaf set is stricter than the criteria for a full row. If a leaf row becomes a full row than excess nodes will be removed.

DistribNet also maintains an accurate estimate on the number of nodes that are on the network. This is possible because unlike with network such as freenet, all nodes are accounted for.

4 Cache Consistency

Maintaining cache consistency is a difficult problem for any network. Some networks, such as Freenet, avoid the issue all together by not having mutable keys. Other networks only keep cached copied of data around for a short time span, therefor reducing the chance of the cache data being out of date. Once a cached copy gets out of date the node either throws the copy away, or checks to see if a newer copy is available. Either approach creates unnecessary network traffic. Another approach is to have the server notify other nodes when ever the data changes. This approach will not scale well as the number of nodes a server needs to keep track of will grow with the network and is completely impractical for a large distributed network. A similar, but more scalable, approach is for nodes on the network to notify each other when ever a key changes. This is the approach that DistribNet uses.

The other issue that needs to be dealt with when is conflicts. That is when two different people modify the same key at nearly the same time. DistribNet avoids this problem by only allowing keys to be added to.

5 Retrieval of Data keys

When a node A wants to retrieve a key K either two things will happen. If it has good reason to believe that a nearby node has the key it will attempt to retrieve it from that node, otherwise it will send a request to find other nodes which have the key.

To do this, node A will contact a node, B, whose key is closer to K than node's A key. Node B which will in tern contact C etc, until an answer is found which for the sake of argument will be node E. Node E will then send a list of possible nodes L which contain key K directly to node A. Node E will then send the result to node D, which will send it to C, etc. Node E will also add node A to list L with probability of say 10%, Node D will do the same but with a probability of say 25%, etc. This will avoid the problem having the list L becomes extremely large for popular data but allow nodes close to A to discover that A has the data since nodes close to A will likely contact the same nodes that A tried. Since A requested the location of key K it is assumed that K will will likely download the data. If this assumption is false than node A will simply be removed at the list latter on.

Once A retrieves the list it will pick a node from the list L based on some evaluation function, lets say it picks node X. Node X will then return the key to node A. The evaluation function will take several factors, into account, including distance, download speed, past reputation, and if node A even knows anything about the new node.

If node X does not send the key back to node A for what ever reason it will remove node X from the list and try again. It will also send this information to node B so it can consider removing node X from its list, it will then in term notify node C of the information, etc. If the key is an index block it will also send information of what parts of the complete key node X has. If the key is not an index block than node a is done.

If the key is an index block than node A will start downloading the sub-blocks of key K that node X has. At the same time, if the key is large or node X does not contain all the sub-blocks of K, node X will chose another node from the list to contact, and possible other nodes depending on the size of the file. It will then download other parts of the file from the other nodes. Which blocks are download from which nodes will chance based on the download speed of the nodes so that more blocks are download from faster nodes and less from slower, thus allowing the data to be transfered in the least amount of time. If after contacting a certain number of nodes there are still parts of the key that are not available on any of those nodes, node A will perform a separate query for the individual blocks. However, I image, in practice this will rarely be necessary.

5.1 Distance determination

One very course estimate for node distance would be to use the XOR distance between two nodes ip address since closer nodes are likely to share the same gateways and nodes really close are likely to be on the same subnet.

Another way to estimate node distance releases on the the fact that node distance, for the most part, obeys the triangle inequality. For each node in the list of candidate nodes some information about the estimated distance between that node, node E, in the list and the node storing the list is maintained by some means. For node A to estimate the distance between a node on the list, node X, and itself all it has to do is combine the distance between it and E with the distance between E and X. The combination function will depend on the aspect of distance that is being measured. For the number of hops it will simply add them, for download speed it will take the maximum, etc.

6 Limitations

Because they is no indirection when retrieving data most of the data on any particular node would be data that a local node user requested at some point in time. This means that it is fairly easy to tell what which keys a particular user requested. Although complete anonymity for the browser is one of my anti-goals this is going a bit to far. One solution for this is to do something similar that GNUNet does which is described in [3].

It is also blatantly obvious which nodes have which keys. Although I do not see this as a major problem, especially if a solution for the first problem is found, it is something to consider. I will be more than happy to entertain solutions to this problem, provided that it doesn't effect effectually that much.

7 Implementation Details

An implementation for DistribNet is available at http://distribnet.sourceforge.net/.

7.1 Physical Storage

Blocks are currently stored in one of three ways

  1. block smaller than a fixed threshold (currently 1k) are stored using Berkeley DB (version 3.3 or better).
  2. blocks larger than the threshold are stored as files. The primary reason for doing this is to avoid limiting the size of data store by the maximum size of a file which is often 2 or 4 GB on most 32-bit systems.
  3. blocks are not stored at all, instead they are linked to an external file out side of the data store much like a symbolic link links to file out side of the current directory. However since blocks often only represent part of the file the offset is also stored as part of the link. These links are stored in the same database that small blocks are stored in. Since the external file can easily be changed by the user, the SHA-1 hashes will be recomputed when the file modification data changes. If the SHA-1 hash of the block differs all the links to the file will be thrown out and the file will be relinked. (This part is not implemented yet).
Most of the code for the data keys can be found in data_key.cpp

7.2 Language

DistribNet is/will be written in fairly modern C++. It will use several external libraries however it will not use any C++ specific libraries. In particular I have no plan to use any sort of Abstraction library for POSIX functionally. Instead thin wrapper classes will be used which I have complete control over and will serve mainly to make the process of using POSIX functions less tedious rather than abstract away the details of using them.

Bibliography

1
GNUNet. http://www.ovmj.org/GNUnet/ and http://www.gnu.org/software/GNUnet/

2
Freenet. http://freenet.sourceforge.net/

3
Krista Bennett and Christian Grothoff. ``GNUnet - anonymity for free''. http://gecko.cs.purdue.edu/GNUnet/papers.php3

4
Antony Rowstron and Peter Druschel. ``Pastry: Scalable, decentralized object location and routing for large-scale peer-to-peer systems''. http://research.microsoft.com/~antr/Pastry/pubs.htm

5
Kademlia: XOR metric-based routing. http://kademlia.scs.cs.nyu.edu/



Kevin Atkinson 2003-05-14