\documentclass[11pt]{article} % preamble \usepackage{a4wide} \usepackage{mathptm} \usepackage{epsfig} \usepackage{alltt} \usepackage{helvet} \makeatletter \def\verbatim@font{\normalfont\sffamily\fontencoding{T1}\selectfont \let\do\do@noligs \verbatim@nolig@list} \renewcommand{\ttdefault}{\sfdefault} \usepackage{url} \def\UrlFont{\sf} \setcounter{secnumdepth}{4} \newcommand{\DAN}[1]{[{\it {#1}}] \\ } \bibliographystyle{acm} \begin{document} \begin{tabular}{ll} Document: & {\bf I-Share Design Note} \\ Subject: & {\bf Bloom filter applications in BitTorrent+ protocol} \\ Author: & {\bf Jelle Roozenburg} \\ Date: & {\bf \today} \\ Status: & {\bf\it DRAFT} \\ \end{tabular} \section{Bloom filter optimization in Preference Exchange and Metadata requests} Preference exchange is one of the new features of the BitTorrent+ protocol described in the I-Share Design Note. Currently, the exchange of preference lists is done in raw format (i.e. by sending a list of concatenated 20 byte SHA1 values of .torrents). This is also the case in the metadata request. This document describes a Bloom filter solution and calculates the bandwidth reduction. \subsection{The current preference exchange} The current preference exchange (PREFERENCE\_EXCHANGE) is described in Section 4.1 of the I-Share design note. Note that the \emph{my\_preference} field in the PREFERENCE\_EXCHANGE message consists of concatenated 20 byte SHA1 hash values, which consume most bandwidth. \subsection{Preference exchange using Bloom filters} The most important issue one has to address when using Bloom filters is that the receiving peer can not simply extract elements from the received filter. It can only test the membership of elements \emph{that it already knows} in the Bloom filter and conclude that these elements were included in the filter. The elements that it did not know can not be extracted. \subsubsection{The new protocol} The \emph{my\_preference} field in the original protocol is now replaces by: \begin{itemize} \item my\_preference\_bloom - dictionary storing a Bloom filter of 20-byte SHA1 hash values, one per prefered file. The maximum length is 1000, minimal length is zero. \begin{itemize} \item num\_elements - the number of values in the Bloom filter (integer) \item bloom\_length - the number of bits of the Bloom filter (integer). Is alway a multiple of 8. \item num\_hash - the number of hash values (integer) \item data - string consisting of the bit array of the Bloom filter. \end{itemize} \end{itemize} The flexibility of the Bloom filter solution is that we can now choose a trade-off between the size of the filter and its error probability. If we assume that we want to use 16 bits/hash value (creating a Bloom filter of maximal 2 KByte), the error rate will be $4.59\cdot 10^{-4}$. This means that if a receiving peer will test the membership of his 1000 SHA1 hash values, it will find 0.459 hash values to be in the filter, which actually are not. So, after half of the preference exchanges, a receiving peer will include one of its own SHA1 values in the list of the sender. If this has too much influence on the BuddyCast protocol, the size of the Bloom filter has to be chosen bigger. Note that with this error, the transmitted filter has becomde 10 times smaller than it was in the old implementation! For this reason, more preference exchanges can be done using the same bandwidth usage and the recommendation will become better. \label{pref_exchange} The problem is that the receiving peer can only make out the SHA1 hash values of the files it has itself. The other SHA1 values cannot be retrieved. With the old protocol, these SHA1 values were received, but the peer needed to do a metadata request to get the rest of the file data. We will do the same metadata request to retrieve both the SHA1 values and the rest of the file data. In this situation both peers need to remember which list of SHA1 hash values they transmitted and which they received. So only in a full exchange of preference lists, Bloom filters can easily be used. When only a preference request would be done, the responding peer would have no clue which of the sent SHA1 values the requesting peer has and which it has not. This is however not the case. The privacy of the peers is also improved with this new approach. After receiving the PREFERENCE\_EXCHANGE message, both peers could decide not to send a GET\_METADATA message, when they think the other peer is not similar enough to share information with. Conclusion: What we did is replacing the raw lists of SHA1 hash values for Bloom filters. This will result in enormous bandwidth reduction for both peers, at the price of a small error and the fact that SHA1 values unknown to any of the peers, still have to be sent during the METADATA response. So the SHA1 values that both peers know don't have to be exchanged. Also the privacy is improved. \subsection{The current metadata exchange} Metadata request (GET\_METADATA) and response (METADATA) is described in Section 4.2. Note that the \emph{SHA1} field in GET\_METADATA message consists of concatenated 20 byte SHA1 hash values, which consume all bandwidth. \subsection{Metadata exchange using Bloom filters} \subsubsection{Metadata request} The \emph{SHA1} field in the GET\_METADATA message is replaced by: \begin{itemize} \item use\_known\_content - (boolean) if true, then the requesting peer want metadata information of all files that were in the preference list of the receiver, but not in the preference list of the sender. If false, then the \emph{request\_bloom} field is sent. \item request\_bloom - dictionary consisting of a Bloom filter storing the SHA1 values that the sending peer already has. So these are the SHA1 values that the receiver does not have to send! If the \emph{use\_known\_content} field is true, this field is not sent. \begin{itemize} \item num\_elements - the number of values in the Bloom filter (integer) \item bloom\_length - the number of bits of the Bloom filter (integer). Is alway a multiple of 8. \item num\_hash - the number of hash values (integer) \item data - string consisting of the bit array of the Bloom filter. \end{itemize} \end{itemize} Normally, the requesting peer will request the metadata information of all files that were not in his preference list, but were in the preference list of the receiver. However, if the requesting peer owns more SHA1 values/metadata than could fit in his preference list, it can add these files to the Bloom filter containing his preference list and send it in this request. This will result in bandwidth reduction when the amount of metadata information that will not be transmitted is bigger than the size of this Bloom filter. If we assume that the Bloom filter uses 2 bytes per SHA1 value and the metadata information size of a torrent is 200 bytes, then a peer must only send this \emph{bloom\_request} field when the number of SHA1 values that it does not wish to receive is more that 1\% of his preference-list size. This design of the GET\_METADATA message makes it impossible to request information about random SHA1 values of content. It is specifically designed assuming that it follows a preference exchange phase. If both has to be possible, a good solution is to make both a GET\_METADATA and a GET\_PREFERENCE\_METADATA message. \subsubsection{Metadata response} As said in Section \ref{pref_exchange}, the metadata response has to include (for each requested content element) the SHA1 hash value, because it is not included explicitely in the Bloom metadata request anymore. So we have to add the field: \begin{itemize} \item SHA1 - the 20-byte SHA1 hash value of the torrent. \end{itemize} \subsection{Bandwidth reduction} In this Section we calculate the bandwidth reduction of different preference and metadata exchanges compared to the old protocol. Assume: \begin{itemize} \item Both peers have a preference list of 1,000 SHA1 values. \item The taste\_buddies and random\_peers fields are omitted in the preference exchange. \item The size of the intersection of both preference lists is 400 values (40\% similar). \item Both peers have no more SHA1 values or metadata than the 1000 items in their preference lists. \item The metadata of one .torrent is 200 Bytes (220 Bytes including its SHA1 hash values). \item All Bloom filters use 16 bits/element (error probability of $4.59\cdot 10^{-4}$). \end{itemize} We will compare the total bandwidth of preference exchange and metadata exchange of unknown files. \subsubsection{Current Protocol} \begin{tabular}{ll} The size of the PREFERENCE\_EXCHANGE message of peer1 & 20,000 Bytes \\ The size of the PREFERENCE\_EXCHANGE message of peer2 & 20,000 Bytes \\ The size of the GET\_METADATA message of peer1 & 12,000 Bytes \\ The size of the GET\_METADATA message of peer2 & 12,000 Bytes \\ The size of the METADATA respons of peer2 & 120,000 Bytes \\ The size of the METADATA respons of peer1 & 120,000 Bytes \\ Total bandwidth usage: & 304,000 Bytes \\ \end{tabular} This is quite a lot for one exchange. In the design document, it says that such an exchange is done every 20 seconds. This means a bandwidth usage of 15 KB/s overhead of only the BuddyCast. \subsubsection{Bloom filter protocol} \begin{tabular}{ll} The size of the PREFERENCE\_EXCHANGE message of peer1 & 2,000 Bytes \\ The size of the PREFERENCE\_EXCHANGE message of peer2 & 2,000 Bytes \\ The size of the GET\_METADATA message of peer1 & 1 Bytes \\ The size of the GET\_METADATA message of peer2 & 1 Bytes \\ The size of the METADATA respons of peer2 & 132,000 Bytes \\ The size of the METADATA respons of peer1 & 132,000 Bytes \\ Total bandwidth usage: & 268,002 Bytes \\ \end{tabular} The reduction is only 12\%. This is mostly due to the fact that most bandwidth is used by sending/receiving the metadata. The bandwidth usage of the rest of the messages was reduces with 90\%. \subsection{Future work} Also the SHA1 hash values of the taste\_buddy field could be put in a Bloom filter. \end{document}