Streaming algorithm
In computer science, streaming algorithms are algorithms for processing data streams in which the input is presented as a sequence of items and can be examined in only a few passes (typically just one). In most models, these algorithms have access to limited memory (generally logarithmic in the size of and/or the maximum value in the stream). They may also have limited processing time per item.
These constraints may mean that an algorithm produces an approximate answer based on a summary or "sketch" of the data stream in memory.
Contents
1 History
2 Models
2.1 Data stream model
2.2 Turnstile and cash register models
2.3 Sliding window model
3 Evaluation
4 Applications
5 Some streaming problems
5.1 Frequency moments
5.1.1 Calculating Frequency Moments
5.1.1.1 Calculating F0 (Distinct Elements in a DataStream)
5.1.1.1.1 FM-Sketch Algorithm
5.1.1.1.2 K-Minimum Value Algorithm
5.1.1.1.3 Complexity analysis of KMV
5.1.1.2 Calculating Fk
5.1.1.2.1 Complexity of Fk
5.1.1.2.2 Simpler approach to calculate F2
5.2 Frequent elements
5.3 Event detection
5.4 Counting distinct elements
5.5 Entropy
5.6 Online learning
6 Lower bounds
7 See also
8 Notes
9 References
10 External links
History
Though streaming algorithms had already been studied by Munro and
Paterson[1] as early as 1980, as well as Philippe Flajolet and
G. Nigel Martin in 1982/83,[2] the field of streaming
algorithms was first formalized and popularized in a 1996 paper by Noga Alon,
Yossi Matias, and Mario Szegedy.[3]
For this paper, the authors later won the Gödel Prize in 2005 "for their foundational
contribution to streaming algorithms." There has since been a large body
of work centered around data streaming algorithms that spans a diverse
spectrum of computer science fields such as theory, databases, networking,
and natural language processing.
Semi-streaming algorithms were introduced in 2005 as an extension of streaming algorithms that allows for a constant or logarithmic number of passes over the dataset [1].
Models
Data stream model
In the data stream model, some or all of the input is represented as a finite sequence of integers (from some finite domain) which is generally not available for random access, but instead arrives one at a time in a "stream".[4] If the stream has length n and the domain has size m, algorithms are generally constrained to use space that is logarithmic in m and n. They can generally make only some small constant number of passes over the stream, sometimes just one.[5]
Turnstile and cash register models
Much of the streaming literature is concerned with computing statistics on
frequency distributions that are too large to be stored. For this class of
problems, there is a vector a=(a1,…,an)displaystyle mathbf a =(a_1,dots ,a_n)
(initialized to the zero vector 0displaystyle mathbf 0 ) that has updates
presented to it in a stream. The goal of these algorithms is to compute
functions of adisplaystyle mathbf a using considerably less space than it
would take to represent adisplaystyle mathbf a precisely. There are two
common models for updating such streams, called the "cash register" and
"turnstile" models.[6]
In the cash register model each update is of the form ⟨i,c⟩displaystyle langle i,crangle , so that aidisplaystyle a_i is incremented by some positive
integer cdisplaystyle c. A notable special case is when c=1displaystyle c=1
(only unit insertions are permitted).
In the turnstile model each update is of the form ⟨i,c⟩displaystyle langle i,crangle , so that aidisplaystyle a_i is incremented by some (possibly negative) integer cdisplaystyle c. In the "strict turnstile" model, no
aidisplaystyle a_i at any time may be less than zero.
Sliding window model
Several papers also consider the "sliding window" model.[citation needed] In this model,
the function of interest is computing over a fixed-size window in the
stream. As the stream progresses, items from the end of the window are
removed from consideration while new items from the stream take their
place.
Besides the above frequency-based problems, some other types of problems
have also been studied. Many graph problems are solved in the setting
where the adjacency matrix or the adjacency list of the graph is streamed in
some unknown order. There are also some problems that are very dependent
on the order of the stream (i.e., asymmetric functions), such as counting
the number of inversions in a stream and finding the longest increasing
subsequence.
Evaluation
The performance of an algorithm that operates on data streams is measured by three basic factors:
- The number of passes the algorithm must make over the stream.
- The available memory.
- The running time of the algorithm.
These algorithms have many similarities with online algorithms since they both require decisions to be made before all data are available, but they are not identical. Data stream algorithms only have limited memory available but they may be able to defer action until a group of points arrive, while online algorithms are required to take action as soon as each point arrives.
If the algorithm is an approximation algorithm then the accuracy of the answer is another key factor. The accuracy is often stated as an (ϵ,δ)displaystyle (epsilon ,delta ) approximation meaning that the algorithm achieves an error of less than ϵdisplaystyle epsilon with probability 1−δdisplaystyle 1-delta .
Applications
Streaming algorithms have several applications in networking such as
monitoring network links for elephant flows, counting the number of
distinct flows, estimating the distribution of flow sizes, and so
on.[7] They also have applications in
databases, such as estimating the size of a join[citation needed].
Some streaming problems
Frequency moments
The kth frequency moment of a set of frequencies
adisplaystyle mathbf a is defined as Fk(a)=∑i=1naikdisplaystyle F_k(mathbf a )=sum _i=1^na_i^k.
The first moment F1displaystyle F_1 is simply the sum of the frequencies
(i.e., the total count). The second moment F2displaystyle F_2 is useful for
computing statistical properties of the data, such as the Gini coefficient
of variation. F∞displaystyle F_infty is defined as the frequency of the
most frequent item(s).
The seminal paper of Alon, Matias, and Szegedy dealt with the
problem of estimating the frequency moments.
Calculating Frequency Moments
A direct approach to find the frequency moments requires to maintain a register mi for all distinct elements ai ∈ (1,2,3,4,...,N) which requires at least memory
of order Ω(N)displaystyle Omega (N).[3] But we have space limitations and require an algorithm that computes in much lower memory. This can be achieved by using approximations instead of exact values. An algorithm that computes an (ε,δ)approximation of Fk, where F'k is the (ε,δ)-
approximated value of Fk.[8] Where ε is the approximation parameter and δ is the confidence parameter.[9]
Calculating F0 (Distinct Elements in a DataStream)
FM-Sketch Algorithm
Flajolet et al. in [2] introduced probabilistic method of counting which was inspired from a paper by Robert Morris[10]. Morris in his paper says that if the requirement of accuracy is dropped, a counter n can be replaced by a counter log n which can be stored in log log n bits.[11] Flajolet et al. in [2] improved this method by using a hash function h which is assumed to uniformly distribute the element in the hash space (a binary string of length L).
- h:[m]→[0,2L−1]displaystyle h:[m]rightarrow [0,2^L-1]
Let bit(y,k) represent the kth bit in binary representation of y
- y=∑k≥0bit(y,k)∗2kdisplaystyle y=sum _kgeq 0mathrm bit (y,k)*2^k
Let ρ(y)displaystyle rho (y) represents the position of least
significant 1-bit in the binary representation of yi with a suitable convention for ρ(0)displaystyle rho (0).
- ρ(y)={Min(k:bit(y,k)==1)if y>0Lif y=0displaystyle rho (y)=begincasesmathrm Min (k:mathrm bit (y,k)==1)&textif y>0\L&textif y=0endcases
Let A be the sequence of data stream of length M whose cardinality need to be determined. Let BITMAP [0...L − 1] be the
hash space where the ρ(hashedvalues) are recorded. The below algorithm then determines approximate cardinality of A.
Procedure FM-Sketch:
for i in 0 to L − 1 do
BITMAP[i]:=0
end for
for x in A: do
Index:=ρ(hash(x))
if BITMAP[index]=0 then BITMAP[index]:=1
end if
end for
B:= Position of left most 0 bit of BITMAP
return 2^B
If there are N distinct elements in a data stream.
- For i≫log(N)displaystyle igg log(N) then BITMAP[i] is certainly 0
- For i≪log(N)displaystyle ill log(N) then BITMAP[i] is certainly 1
- For i≈log(N)displaystyle iapprox log(N) then BITMAP[i] is a fringes of 0's and 1's
K-Minimum Value Algorithm
The previous algorithm describes the first attempt to approximate F0 in the data stream by Flajolet and Martin. Their algorithm picks a random hash function which they assume to uniformly distribute the hash values in hash space.
Bar-Yossef et al. in,[9] introduces k-minimum value algorithm for determining number of distinct elements in data stream. They uses a similar hash function h which can be normalized to [0,1] as h:[m]→[0,1]displaystyle h:[m]rightarrow [0,1]. But they fixed a limit t to number of values in hash space. The value of t is assumed of the order O(1ε2)displaystyle Oleft(dfrac 1varepsilon _2right) (i.e. less approximation-value ε requires more t). KMV algorithm keeps only t-smallest hash values in the hash space. After all the m values of stream are arrived, υ=Max(h(ai))displaystyle upsilon =mathrm Max (h(a_i)) is used to calculateF0′=tυdisplaystyle F'_0=dfrac tupsilon . That is, in a close-to uniform hash space, they expect at-least t elements to be less than O(tF0)displaystyle Oleft(dfrac tF_0right).
Procedure 2 K-Minimum Value
Initialize first t values of KMV
for a in a1 to an do
if h(a) < Max(KMV) then
Remove Max(KMV) from KMV set
Insert h(a) to KMV
end if
end for
return t/Max(KMV)
Complexity analysis of KMV
KMV algorithm can be implemented in O((1ε2)⋅log(m))displaystyle Oleft(left(dfrac 1varepsilon _2right)cdot log(m)right) memory bits space. Each hash value requires space of order O(log(m))displaystyle O(log(m)) memory bits. There are hash values of the order O(1ε2)displaystyle Oleft(dfrac 1varepsilon _2right). The access time can be reduced if we store the t hash values in a binary tree. Thus the time complexity will be reduced to O(log(1ε)⋅log(m))displaystyle Oleft(log left(dfrac 1varepsilon right)cdot log(m)right).
Calculating Fk
Alon et al. in [3] estimates Fk by defining random variables that can be computed within given space and time. The expected value of random variable gives the approximate value of Fk.
Let us assume length of sequence m is known in advance.
Construct a random variable X as follows:
- Select ap be a random member of sequence A with index at p, ap=l∈(1,2,3,…,n)displaystyle a_p=lin (1,2,3,ldots ,n)
- Let r=|q:q≥p,aq=l|, represents the number of occurrences of l within the members of the sequence A following ap.
- Random variable X=m(rk−(r−1)k)displaystyle X=m(r^k-(r-1)^k).
Assume S1 be of the order O(n1−1/k/λ2)displaystyle O(n^1-1/k/lambda ^2) and S2 be of the order O(log(1/ε))displaystyle O(log(1/varepsilon )). Algorithm takes S2 random
variable Y1,Y2,...,YS2 and outputs the median Y . Where Yi is the average of Xij
where 1 ≤ j ≤ S1.
Now calculate expectation of random variable E(X).
- E(X)=∑i=1n∑i=1mi(jk−(j−1)k)=mm[(1k+(2k−1k)+…+(m1k−(m1−1)k))+(1k+(2k−1k)+…+(m2k−(m2−1)k))+…+(1k+(2k−1k)+…+(mnk−(mn−1)k))]=∑i=1nmik=Fkdisplaystyle beginarraylllE(X)&=&sum _i=1^nsum _i=1^m_i(j^k-(j-1)^k)\&=&frac mm[(1^k+(2^k-1^k)+ldots +(m_1^k-(m_1-1)^k))\&&;+;(1^k+(2^k-1^k)+ldots +(m_2^k-(m_2-1)^k))+ldots \&&;+;(1^k+(2^k-1^k)+ldots +(m_n^k-(m_n-1)^k))]\&=&sum _i=1^nm_i^k=F_kendarray
Complexity of Fk
From the algorithm to calculate Fk discussed above, we can see that each random variable X stores value of ap and r. So, to compute X we need to maintain only log(n) bits for storing ap and log(n) bits for storing r. Total number of random variable X will be the S1∗S2displaystyle S_1*S_2.
Hence the total space complexity the algorithm takes is of the order of O(klog1ελ2n1−1k(logn+logm))displaystyle Oleft(dfrac klog 1 over varepsilon lambda ^2n^1-1 over kleft(log n+log mright)right)
Simpler approach to calculate F2
The previous algorithm calculates F2displaystyle F_2 in order of O(n(logm+logn))displaystyle O(sqrt n(log m+log n)) memory bits. Alon et al. in [3] simplified this algorithm using four-wise independent random variable with values mapped to −1,1displaystyle -1,1.
This further reduces the complexity to calculate F2displaystyle F_2 to O(log1ελ2(logn+logm))displaystyle Oleft(dfrac log 1 over varepsilon lambda ^2left(log n+log mright)right)
Frequent elements
In the data stream model, the frequent elements problem is to output a set of elements that constitute more than some fixed fraction of the stream. A special case is the majority problem, which is to determine whether or not any value constitutes a majority of the stream.
More formally, fix some positive constant c > 1, let the length of the stream be m, and let fi denote the frequency of value i in the stream. The frequent elements problem is to output the set fi > m/c .[12]
Some notable algorithms are:
- Boyer–Moore majority vote algorithm
- Karp-Papadimitriou-Shenker algorithm
- Count-Min sketch
- Sticky sampling
- Lossy counting
- Sample and Hold
- Multi-stage Bloom filters
- Count-sketch
- Sketch-guided sampling
Event detection
Detecting events in data streams is often done using a heavy hitters algorithm as listed above: the most frequent items and their frequency are determined using one of these algorithms, then the largest increase over the previous time point is reported as trend. This approach can be refined by using exponentially weighted moving averages and variance for normalization.[13]
Counting distinct elements
Counting the number of distinct elements in a stream (sometimes called the
F0 moment) is another problem that has been well studied.
The first algorithm for it was proposed by Flajolet and Martin. In 2010, D. Kane, J. Nelson and D. Woodruff found an asymptotically optimal algorithm for this problem.[14] It uses O(ε2 + log d) space, with O(1) worst-case update and reporting times, as well as universal hash functions and a r-wise independent hash family where r = Ω(log(1/ε) / log log(1/ε)).
Entropy
The (empirical) entropy of a set of frequencies adisplaystyle mathbf a is
defined as Fk(a)=∑i=1naimlogaimdisplaystyle F_k(mathbf a )=sum _i=1^nfrac a_imlog frac a_im, where m=∑i=1naidisplaystyle m=sum _i=1^na_i.
Estimation of this quantity in a stream has been done by:
- McGregor et al.[citation needed]
- Do Ba et al.[citation needed]
- Lall et al.[citation needed]
- Chakrabarti et al.[citation needed]
Online learning
Learn a model (e.g. a classifier) by a single pass over a training set.
- Feature hashing
- Stochastic gradient descent
Lower bounds
Lower bounds have been computed for many of the data streaming problems
that have been studied. By far, the most common technique for computing
these lower bounds has been using communication complexity.
See also
- Data stream mining
- Data stream clustering
- Online algorithm
- Stream processing
- Sequential algorithm
Notes
^ Munro & Paterson (1980)
^ abc Flajolet & Martin (1985)
^ abcd Alon, Matias & Szegedy (1996)
^ Babcock, Brian; Babu, Shivnath; Datar, Mayur; Motwani, Rajeev; Widom, Jennifer (2002). "Models and Issues in Data Stream Systems". Proceedings of the Twenty-first ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. PODS '02. New York, NY, USA: ACM: 1–16. doi:10.1145/543613.543615. ISBN 1581135076.
^ Bar-Yossef, Ziv; Jayram, T. S.; Kumar, Ravi; Sivakumar, D.; Trevisan, Luca (2002-09-13). "Counting Distinct Elements in a Data Stream". Randomization and Approximation Techniques in Computer Science. Lecture Notes in Computer Science. Springer, Berlin, Heidelberg: 1–10. doi:10.1007/3-540-45726-7_1. ISBN 3540457267.
^ Gilbert et al. (2001)
^ Xu (2007)
^ Indyk, Piotr; Woodruff, David (2005-01-01). "Optimal Approximations of the Frequency Moments of Data Streams". Proceedings of the Thirty-seventh Annual ACM Symposium on Theory of Computing. STOC '05. New York, NY, USA: ACM: 202–208. doi:10.1145/1060590.1060621. ISBN 1-58113-960-8.
^ ab Bar-Yossef, Ziv; Jayram, T. S.; Kumar, Ravi; Sivakumar, D.; Trevisan, Luca (2002-09-13). Rolim, José D. P.; Vadhan, Salil, eds. Counting Distinct Elements in a Data Stream. Lecture Notes in Computer Science. Springer Berlin Heidelberg. pp. 1–10. ISBN 978-3-540-44147-2.
^ Morris (1978)
^ Flajolet, Philippe (1985-03-01). "Approximate counting: A detailed analysis". BIT Numerical Mathematics. 25 (1): 113–134. doi:10.1007/BF01934993. ISSN 0006-3835.
^ Cormode, Graham (2014). Kao, Ming-Yang, ed. Encyclopedia of Algorithms. Springer US. pp. 1–5. doi:10.1007/978-3-642-27848-8_572-1. ISBN 9783642278488.
^ Schubert, E.; Weiler, M.; Kriegel, H. P. (2014). SigniTrend: scalable detection of emerging topics in textual streams by hashed significance thresholds. Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '14. pp. 871–880. doi:10.1145/2623330.2623740. ISBN 9781450329569.
^ Kane, Nelson & Woodruff (2010)
References
Alon, Noga; Matias, Yossi; Szegedy, Mario (1999), "The space complexity of approximating the frequency moments", Journal of Computer and System Sciences, 58 (1): 137–147, doi:10.1006/jcss.1997.1545, ISSN 0022-0000 . First published as Alon, Noga; Matias, Yossi; Szegedy, Mario (1996), "The space complexity of approximating the frequency moments", Proceedings of the 28th ACM Symposium on Theory of Computing (STOC 1996), pp. 20–29, doi:10.1145/237814.237823, ISBN 0-89791-785-5 .
Babcock, Brian; Babu, Shivnath; Datar, Mayur; Motwani, Rajeev; Widom, Jennifer (2002), "Models and issues in data stream systems", Proceedings of the 21st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS 2002) (PDF), pp. 1–16, doi:10.1145/543613.543615 .
Gilbert, A. C.; Kotidis, Y.; Muthukrishnan, S.; Strauss, M. J. (2001), "Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries" (PDF), Proceedings of the International Conference on Very Large Data Bases: 79–88 .
Kane, Daniel M.; Nelson, Jelani; Woodruff, David P. (2010), An optimal algorithm for the distinct elements problem, PODS '10, New York, NY, USA: ACM, pp. 41–52, doi:10.1145/1807085.1807094, ISBN 978-1-4503-0033-9 .
Karp, R. M.; Papadimitriou, C. H.; Shenker, S. (2003), "A simple algorithm for finding frequent elements in streams and bags", ACM Transactions on Database Systems, 28 (1): 51–55, doi:10.1145/762471.762473 .
Lall, Ashwin; Sekar, Vyas; Ogihara, Mitsunori; Xu, Jun; Zhang, Hui (2006), "Data streaming algorithms for estimating entropy of network traffic", Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems (ACM SIGMETRICS 2006) (PDF), doi:10.1145/1140277.1140295 [permanent dead link].
Xu, Jun (Jim) (2007), A Tutorial on Network Data Streaming (PDF) .- Heath, D., Kasif, S., Kosaraju, R., Salzberg, S., Sullivan, G., "Learning Nested Concepts With Limited Storage", Proceeding IJCAI'91 Proceedings of the 12th international joint conference on Artificial intelligence - Volume 2, Pages 777-782, Morgan Kaufmann Publishers Inc. San Francisco, CA, USA ©1991
Morris, Robert (1978), "Counting large numbers of events in small registers", Communications of the ACM, 21 (10): 840–842, doi:10.1145/359619.359627 .
External links
- Princeton Lecture Notes
Streaming Algorithms for Geometric Problems, by Piotr Indyk, MIT- Dagstuhl Workshop on Sublinear Algorithms
List of open problems in streaming (compiled by Andrew McGregor) from discussion at the IITK Workshop on Algorithms for Data Streams, 2006.- StreamIt - programming language and compilation infrastructure by MIT CSAIL
- IBM Spade - Stream Processing Application Declarative Engine
- IBM InfoSphere Streams
- Tutorials and surveys
Data Stream Algorithms and Applications by S. Muthu Muthukrishnan- Stanford STREAM project survey
Network Applications of Bloom filters, by Broder and Mitzenmacher- Xu's SIGMETRICS 2007 tutorial
Lecture notes from Data Streams course at Barbados in 2009, by Andrew McGregor and S. Muthu Muthukrishnan
- Courses
- Dartmouth
- MIT
- Rice
- Rutgers
- University at Buffalo
Clash Royale CLAN TAG#URR8PPP