Each point represents a point (base pair) on the human genome. For each data set, There are 200,000,000 potential points. The data is basically a list of ~12000 index/value pairs, where all other points are assumed to be zero. What I need to do is to get a moving average across the entire data set and return the average value above the threshold I am currently reading each point sequentially from the data set and building an array around each point I find, but this is very slow for large window sizes. Yes Is there no more efficient way to do this, maybe scipy or pandas?
Edit: Jamie’s magic code below is great (but I can’t vote yet)! I am very grateful.
indices = np.unique(np.random.randint(2e8,size=(12000,)))
values = np.random.rand(len(indices))
Then I build an index array with a total window size of 2 * win 1 around each index, and the corresponding array to indicate the point’s contribution to the moving average:
win = 10
avg_idx = np.arange(-win, win+1) + indices[:, None]
avg_val = np.tile(values[:, None]/(2* win+1), (1, 2*win+1))
All that is left is to find the repeated exponents and increase the contribution to the moving average together:
p>
unique_idx, _ = np.unique(avg_idx, return_inverse=True)
mov_avg = np.bincount(_, weights=avg_val.ravel())
You can now Index list, for example, the moving average exceeds 0.5, as follows:
unique_idx[mov_avg> 0.5]
As for performance, first convert the above code into a function :
def sparse_mov_avg(idx, val, win):
avg_idx = np.arange(-win, win+1) + idx[:, None]< br /> avg_val = np.tile(val[:, None]/(2*win+1), (1, 2*win+1))
unique_idx, _ = np .unique(avg_idx, return_inverse=True)
mov_avg = np.bincount(_, weights=avg_val.ravel())
return unique_idx, mov_avg
The following are several window sizes Some timings, for the test data described at the beginning:
In [2]: %timeit sparse_mov_avg(indices, values, 10)
10 loops, best of 3 : 33.7 ms per loop
In [3]: %timeit sparse_mov_avg(indices, values, 100)
1 loops, best of 3: 378 ms per loop
In [4]: %timeit sparse_mov_avg(indices, values, 1000)
1 loops, best of 3: 4.33 s per loop
I am accepting some Genome analysis, I am a bit stuck. I have some very sparse data and need to find where the moving average exceeds a certain threshold and mark each point as 1 or 0. The data is the only type, so I can’t use the available programs Perform analysis.
Each point represents a point (base pair) on the human genome. For each data set, there are 200,000,000 potential points. The data is basically ~12000 indexes A list of /value pairs, where all other points are assumed to be zero. What I need to do is to get the moving average in the entire data set and return the area where the average is above the threshold.
I am currently getting the data from Centralize and read each point sequentially and build an array around each point I found, but it is very slow for large window sizes. Is there a more efficient way to do this, maybe scipy or Panda?
Edit: Jamie’s magic code below is great (but I can’t vote yet)! I am very grateful.
You can vectorize the whole thing with numpy. I have built this (aprox.) random data set with 12,000 indexes, Between 0 and 199,999,999, and an equally long list of random floating point numbers between 0 and 1:
indices = np.unique(np.random .randint(2e8,size=(12000,)))
values = np.random.rand(len(indices))
Then I build a total window size around each index as 2 * The index array of win 1, and the corresponding array, indicate the contribution of the point to the moving average:
win = 10
avg_idx = np.arange(-win, win+1) + indices[:, None]
avg_val = np.tile(values[:, None]/(2*win+1), (1, 2*win+ 1))
The rest is to find the repeated index and increase the contribution to the moving average together:
unique_idx, _ = np.unique (avg_idx, return_inverse=True)
mov_avg = np.bincount(_, weights=avg_val.ravel())
You can now get a list of indexes, for example, the moving average exceeds 0.5, as follows:
unique_idx[mov_avg> 0.5]
As for performance, first convert the above code into a function:
def sparse_mov_avg(idx, val, win):
avg_idx = np.arange(-win, win+1) + idx[:, None]
avg_val = np.tile(val[: , None]/(2*win+1), (1, 2*win+1))
unique_idx, _ = np.unique(avg_idx, return_inverse=True)
mov_avg = np.binco unt(_, weights=avg_val.ravel())
return unique_idx, mov_avg
The following are some timings of several window sizes, for the test data described at the beginning:
In [2]: %timeit sparse_mov_avg(indices, values, 10)
10 loops, best of 3: 33.7 ms per loop
In [3] : %timeit sparse_mov_avg(indices, values, 100)
1 loops, best of 3: 378 ms per loop
In [4]: %timeit sparse_mov_avg(indices, values, 1000)< br />1 loops, best of 3: 4.33 s per loop