[Data Structure] Big Data Processing Interview Solution

The idea of ​​big data processing is generally like this: Divide a file that cannot fit in the memory into small files according to a certain method, and then see if there is a suitable data structure that can solve this problem. Of course, sometimes there is no need to segment, but bitmaps can also be used, depending on the specific problem. Next, let’s take a look at the following big data interview questions. Generally, interviewers just need an idea.

1) Given a log file with a size of more than 100G, the IP address is stored in the log, and the algorithm is designed to find the IP address with the most occurrences?

Problem analysis:

100G of ordinary machine memory is definitely not enough, the current IP address is equivalent to A 32-bit string, so, for such a large file, we consider splitting, but just splitting is not enough. Assuming that there is 1G of available memory, then we split into 100 parts, then we have to traverse all 100 files Once again, count the number of occurrences of each IP, and finally find the IP with the most occurrences. This is a solution, but it is not efficient.

Solution:

Convert the 32-bit IP to the corresponding integer through the string hash function, Divide 100G files into 1000 pieces, read these IPs from the hard disk, take the remainder of the number of files after transforming into shaping, and put the remainder into which file, so as to ensure that the same IP must be in the same file. This method is also called Hash bucketing method. After dividing, count the most frequent occurrences in each file, and the problem will be transformed into finding the one with the most frequent occurrences of these 1000 IPs.

(2) The conditions are the same as the above question, how to find the IP of top K?

Problem analysis:

The previous question is to find the one with the most occurrences , This is the top K with the most. It is easy to think of finding the most frequent files in each file as before, and then looking for the top K, but this is actually wrong. why? Because there is no guarantee that the largest one in one file will be larger than the smallest one in another file. For example, the maximum number of occurrences in file 1 is 10, while the minimum number of occurrences in file 2 is 100.

Solution:

Same as the first question, first use the hash bucket method Divide, find the top K in each file, and finally summarize the top K. How to find top k? Here, we can completely consider building a small heap of K data. The last remaining number in the heap is the k largest number. Why do we need to create a small heap? Isn’t it normal to think that a large pile should be built? You can refer to the blog I wrote about the heap before, and finally I said: http://blog.csdn.net/pointer_y/article/details/52791518

(3) Given 10 billion integers, The algorithm is designed to find integers that appear only once.

Problem analysis:

First of all, 10 billion integer memory will definitely not fit, secondly , The integer range is more than 4.2 billion, so there must be many repeated numbers in this 10 billion. Generally, the first thing that comes to mind is to split, cut to the memory to put it down, and then traverse to find. It’s too slow, it looks for the ones that only appear once, you can consider using bitmaps.

Solution:

The bitmap generally uses 1 bit to represent the number Exist or non-existent, here it only appears once, so it is represented by two bits. Two bits are 4 states. We only use 3 states. They are non-existent, once-occurring, and more appearing. We don’t care about the specific number of times, but it only appears once here. Two bits are used to represent a number, and for 4.2 billion numbers, it is ok to open a bitmap with a large G size.

(4) Given two files, each with 10 billion integers, we only have 1G memory, how to find the intersection of the two files?

Problem analysis:

It can’t be stored inside, we first thought of cutting, The A file is divided into 100 copies, and the B file is used for comparison. The disadvantage is still slow and inefficient.

Solution:

Hash bucket method, 10 billion integer points To 1000 files (generally the number of files divided by the hash bucket method is relatively large), each number is taken as the remainder of 1000, and placed in the file corresponding to the remainder. After the two files have been hashed into buckets, only the files with the same number need to be compared, because hashed bucketing ensures that integers of the same size must be allocated to the same file. This is much more efficient.

(5) A file has 10 billion ints and 1G memory. The design algorithm finds all integers that appear no more than 2 times.

Problem analysis:

This question is very similar to the third question. However, it does not find more than two occurrences, that is, one occurrence and two occurrences.

Solution:

Still use bitmap, which is represented by two bits A number and four states are: no occurrence, one occurrence, two occurrences, and multiple occurrences. This time we pay attention to the one that appears once and the one that appears twice.

(6) Given two files, each with 10 billion queries, we only have 1G memory, how to find the intersection of the two files? The exact algorithm and approximate algorithm are given respectively.

Problem analysis:

This is another problem of finding intersections, so you can Consider using the hash bucket method to solve it, so what is the approximate algorithm? Convert the string into an integer. When an element is added to the collection, map this element into a Bit array through K Hash functions K points of , set them to 1. When searching, we only need to see if these points are all 1 to (approximately) know whether there is it in the collection:

Drawing a picture:


Solution :

The exact algorithm uses the hash bucket method, which is very similar to the previous question. The Bloom filter is used to approximate the results, and the results are similar.

(7) How to extend BloomFilter to make it support the operation of deleting elements?

Problem analysis:

Because a string is represented by multiple bits , Other character strings are likely to occupy the same bits as the character string to be deleted (as shown in the figure above), so they cannot be deleted directly, which affects the representation of other character states.

Solution:

Use reference counting, but because of the number of times the string appears It may be relatively large. A char is not enough. We use size_t, but this will lose the space-saving advantage of the Bloom filter. There is no way. If the file is too large, it can only be divided first. Using the reference count does not need to take a bit to indicate the existence or nonexistence, and directly use the reference count to indicate that it does not exist. I have implemented a bloom that supports deletion here.

Source code:

#include#include #includeusing namespace std;struct HashFunc1{ size_t operator()(const char* str) {register size_t hash = 0; while (size_t ch = (size_t)*str++) {hash = hash * 131 + ch; // can also be multiplied by 31, 131, 1313, 13131, 131313..} return hash; });struct HashFunc2{ size_t operator()(const char* str) {register size_t hash = 0; while (size_t ch = (size_t)*str++) {hash = 65599 * hash + ch; //hash = (size_t)ch + (hash << 6) + (hash << 16)-hash;} return hash; });struct HashFunc3 {size_t operator()(const char* str) {register size_t hash = 0; size_t magic = 63689; while (size_t ch = (size_t)*str++) {hash = hash * magic + ch; magic *= 378551;} return hash; }};struct HashFunc4{ size_t operator()(const char* str) {register size_t hash = 0; size_t ch; for (long i = 0; ch = (size_t)*str++; i++) { if ((i & 1) == 0) {hash ^= ((hash << 7) ^ ch ^ (hash >> 3));} else {hash ^= (~((hash << 11) ^ ch ^ (hash >> 5)));}} return hash; });struct HashFunc5{ size_t operator()(const char* str) {if (!*str) return 0; register size_t hash = 1315423911; while (size_t ch = (size_t)*str++) {hash ^= ((hash << 5) + ch + (hash >> 2));} return hash; });//a string type Bloom filter template< class T=string,class _HashFunc1=HashFunc1, class _HashFunc2=HashFunc2,class _HashFunc3=HashFunc3, class _HashFunc4=HashFunc4, class _HashFunc5=HashFunc5>class BloomFilter(public: BloomFilter(int num *) :_range(public: BloomFilter(int num *)). resize(_range);} void Set(const char* str) {//Get 5 corresponding positions through 5 string hash functions respectively size_t hash1 = HashFunc1()(str) %_range; size_t hash2 = HashFunc2()( str)% _range; size_t hash3 = HashFunc3()(str)% _range; size_t hash4 = HashFunc4()(str)% _range; size_t hash5 = HashFunc5()(str)% _range; _RefBloomfilter[hash1]++; _RefBloomfilter[ has h2]++; _RefBloomfilter[hash3]++; _RefBloomfilter[hash4]++; _RefBloomfilter[hash5]++; printf("%d,%d,%d,%d,%d", hash1, hash2, hash3 , hash4, hash5); //You can output 5 locations and have a look cout << endl;} bool ReSet(const char* str) //Delete {size_t hash1 = HashFunc1()(str)% _range; size_t hash2 = HashFunc2 ()(str)% _range; size_t hash3 = HashFunc3()(str)% _range; size_t hash4 = HashFunc4()(str)% _range; size_t hash5 = HashFunc5()(str)% _range; if (_RefBloomfilter[hash1] == 0 || _RefBloomfilter[hash2] == 0 || _RefBloomfilter[hash3] == 0 || _RefBloomfilter[hash4] == 0 || _RefBloomfilter[hash5] == 0) return false; _RefBloomfilter[hash1]--; _RefBloomfilter[hash2]--; _RefBloomfilter[hash3]--; _RefBloomfilter[hash4]--; _RefBloomfilter[hash5]--; return true;} bool Test(const char* str) //Find {size_t hash1 = HashFunc1() (str)% _range; size_t hash2 = HashFunc2()(str)% _range; size_t hash3 = HashFunc3()(str)% _range; size_t hash4 = HashFunc4()(str)% _range; size_t hash5 = HashFunc5()(s tr)% _range; if (_RefBloomfilter[hash1]&& _RefBloomfilter[hash2] && _RefBloomfilter[hash3] && _RefBloomfilter[hash4] && _RefBloomfilter[hash5])//If it is not 0, it exists return true; else return false; }protected: vector _RefBloomfilter; //In order to support deletion, directly use a vector of size_t for reference counting size_t _range;};void TestBloomfilter(){ BloomFilter<> bf(100); string url1("http://write. blog.csdn.net/postedit/53088473"); string url2("http://blog.csdn.net/pointer_y/article/details/52926334"); string url3("http://blog.csdn.net/ pointer_y/article/details/52776595"); bf.Set(url1.c_str()); bf.Set(url2.c_str()); bf.Set(url3.c_str()); cout << bf.Test( url1.c_str()) << endl; bf.ReSet(url1.c_str()); cout << bf.Test(url1.c_str())<

Analysis of the results: I only tested it with three strings, and the positions occupied by these three strings were not repeated. The search string 1 before deletion is found, and the result is 0 not found after deletion. This is a bloom filter that supports deletion. The general bloom filter is implemented by bitmaps. The existence or nonexistence is represented by multiple bits, which saves space, but it does not work if it supports deletion.

I mentioned the concept of bitmap many times before, what is a bitmap? Even if a bit is used to indicate whether the corresponding data exists or not, there is also a bitmap in STL. I implemented one myself, which only contains the commonly used set, reset, and test methods.

#pragma once#include#includeusing namespace std ;class BitMap{public: BitMap(size_t size) {_bitMap.resize((size>>5)+1);} void Set(const int& x) {int index = x >> 5; //In addition efficiency is not high, So here we use right shift 5 bits to mean divide by 32, get the number of data where x is int bit = x% 32; //Get the bit of x _bitMap[index] |= (1 << bit);} void ReSet(const int& x) {int index = x >> 5; int bit = x% 32; _bitMap[index] &= (~(1 << bit));} bool Test(const int& x) {int index = x >> 5; int bit = x% 32; return _bitMap[index] & (1 << bit); }private: vector _bitMap;};void TestBitMap(){ BitMap bm(40); bm.Set(10) ; bm.Test(10); bm.Test(20);}

These are some of my views on big data problems, share them, if there are better solutions to the above problems, please feel free to enlighten me~ p>

Hash bucketing method, 10 billion integers are divided into 1000 files (generally, the number of files divided by hash bucketing method is relatively large), and each number is calculated by taking the remainder of 1000 and placing it in the remainder corresponding In the file. After the two files have been hashed into buckets, only the files with the same number need to be compared, because hashed bucketing ensures that integers of the same size must be allocated to the same file. This is much more efficient.

This question is very similar to the third question. However, it does not find more than two occurrences, that is, one occurrence and two occurrences.

The bitmap is still used, and two bits are used to represent a number. The four states are: no occurrence, one occurrence, two occurrences, and multiple occurrences. This time we pay attention to the one that appears once and the one that appears twice.

Leave a Comment

Your email address will not be published.