Effectively obtain the moving average of the sparse data and filter threshold above Python

I am undergoing some genomic analysis and I am a bit stuck. I have some very sparse data and need to find where the moving average exceeds a certain threshold and mark each point as 1 Or 0. The data is the only type, so I cannot use the available programs for analysis.

Each point represents a point (base pair) on the human genome. For each data set, There are 200,000,000 potential points. The data is basically a list of ~12000 index/value pairs, where all other points are assumed to be zero. What I need to do is to get a moving average across the entire data set and return the average value above the threshold I am currently reading each point sequentially from the data set and building an array around each point I find, but this is very slow for large window sizes. Yes Is there no more efficient way to do this, maybe scipy or pandas?

Edit: Jamie’s magic code below is great (but I can’t vote yet)! I am very grateful.

You can vectorize the whole thing with numpy. I have built this (aprox.) A random data set with 12,000 indexes, between 0 and 199,999,999, and a list of random floating-point numbers between 0 and 1 of the same length:

indices = np.unique(np.random.randint(2e8,size=(12000,)))
values ​​= np.random.rand(len(indices))

Then I build an index array with a total window size of 2 * win 1 around each index, and the corresponding array to indicate the point’s contribution to the moving average:

win = 10

avg_idx = np.arange(-win, win+1) + indices[:, None]
avg_val = np.tile(values[:, None]/(2* win+1), (1, 2*win+1))

All that is left is to find the repeated exponents and increase the contribution to the moving average together:

unique_idx, _ = np.unique(avg_idx, return_inverse=True)
mov_avg = np.bincount(_, weights=avg_val.ravel())

You can now Index list, for example, the moving average exceeds 0.5, as follows:

unique_idx[mov_avg> 0.5]

As for performance, first convert the above code into a function :

def sparse_mov_avg(idx, val, win):
avg_idx = np.arange(-win, win+1) + idx[:, None]< br /> avg_val = np.tile(val[:, None]/(2*win+1), (1, 2*win+1))
unique_idx, _ = np .unique(avg_idx, return_inverse=True)
mov_avg = np.bincount(_, weights=avg_val.ravel())
return unique_idx, mov_avg

The following are several window sizes Some timings, for the test data described at the beginning:

In [2]: %timeit sparse_mov_avg(indices, values, 10)
10 loops, best of 3 : 33.7 ms per loop

In [3]: %timeit sparse_mov_avg(indices, values, 100)
1 loops, best of 3: 378 ms per loop

In [4]: ​​%timeit sparse_mov_avg(indices, values, 1000)
1 loops, best of 3: 4.33 s per loop

I am accepting some Genome analysis, I am a bit stuck. I have some very sparse data and need to find where the moving average exceeds a certain threshold and mark each point as 1 or 0. The data is the only type, so I can’t use the available programs Perform analysis.

Each point represents a point (base pair) on the human genome. For each data set, there are 200,000,000 potential points. The data is basically ~12000 indexes A list of /value pairs, where all other points are assumed to be zero. What I need to do is to get the moving average in the entire data set and return the area where the average is above the threshold.

I am currently getting the data from Centralize and read each point sequentially and build an array around each point I found, but it is very slow for large window sizes. Is there a more efficient way to do this, maybe scipy or Panda?

Edit: Jamie’s magic code below is great (but I can’t vote yet)! I am very grateful.

You can vectorize the whole thing with numpy. I have built this (aprox.) random data set with 12,000 indexes, Between 0 and 199,999,999, and an equally long list of random floating point numbers between 0 and 1:

indices = np.unique(np.random .randint(2e8,size=(12000,)))
values ​​= np.random.rand(len(indices))

Then I build a total window size around each index as 2 * The index array of win 1, and the corresponding array, indicate the contribution of the point to the moving average:

win = 10

avg_idx = np.arange(-win, win+1) + indices[:, None]
avg_val = np.tile(values[:, None]/(2*win+1), (1, 2*win+ 1))

The rest is to find the repeated index and increase the contribution to the moving average together:

unique_idx, _ = np.unique (avg_idx, return_inverse=True)
mov_avg = np.bincount(_, weights=avg_val.ravel())

You can now get a list of indexes, for example, the moving average exceeds 0.5, as follows:

unique_idx[mov_avg> 0.5]

As for performance, first convert the above code into a function:

def sparse_mov_avg(idx, val, win):
avg_idx = np.arange(-win, win+1) + idx[:, None]
avg_val = np.tile(val[: , None]/(2*win+1), (1, 2*win+1))
unique_idx, _ = np.unique(avg_idx, return_inverse=True)
mov_avg = np.binco unt(_, weights=avg_val.ravel())
return unique_idx, mov_avg

The following are some timings of several window sizes, for the test data described at the beginning:

In [2]: %timeit sparse_mov_avg(indices, values, 10)
10 loops, best of 3: 33.7 ms per loop

In [3] : %timeit sparse_mov_avg(indices, values, 100)
1 loops, best of 3: 378 ms per loop

In [4]: ​​%timeit sparse_mov_avg(indices, values, 1000)< br />1 loops, best of 3: 4.33 s per loop

Leave a Comment

Your email address will not be published.