Riak performance – unexpected results

In the last days, I had a fight with Riak. The initial setup was easier than I thought. Now I have a 3-node cluster. For testing purposes, all nodes are running on the same vm.

I admit that the hardware settings of my virtual machine are very degraded (1 CPU, 512 MB RAM), but I am still very surprised by the performance of riak.

Map reduction

Use map to reduce a bit. I have about 2000 objects in a bucket, each with a size of about 1k-2k as json. I use this map function:

function(value, keyData, arg) {
var data = Riak.mapValuesJson(value)[0];

if (data.displayname.indexOf("max") !== -1) return [data];
return [];
}

And for execution It took 2 seconds for the http request to return the result instead of calculating the time in my client code to exclude the json result. Deleting 2 of the 3 nodes seems to slightly improve performance for two seconds, but it still seems very slow to me.

Is this expected? The bytes of the object are not very large, and the 2000 objects in a bucket are not that large.

Insert

It took a long time to batch insert about 60.000 objects of the same size as the above, and there is no real work in fact.

My script for inserting objects in riak died at about 40.000 and said it could not connect to the riak node. In the riak log, I found an error message indicating that the node has run out of memory and died.

Question

This is really my first shot in riak, so there is a certain chance that I messed up something.

Are there any settings that can be adjusted?
Are the hardware settings too restricted?
>Maybe the PHP client library I use to interact with riak is the limiting factor?
>It is quite silly to run all nodes on the same physical machine, but if this is a problem-how can I better test the performance of riak?
Is the map reduction really that slow? I read about the performance metrics of the thumbnail feature in the riak mailing list, but if Map Reduce is slow, should you perform a “query” on the data you need in almost real time? I know that riak is not as fast as redis.

If someone with more riak experience can help me solve some of these problems, it will really help me a lot.

This answer is a bit late, but I want to point out that Riak’s mapreduce implementation is mainly for processing links Instead of the entire barrel.

The internal design of Riak is actually optimized for the entire bucket. This is because the bucket is not considered a sequence table, but a key space distributed on the node cluster. This means that random access is very fast-maybe O(log n), but don’t quote me-and serial access is very, very slow. Serial access, the method currently designed by Riak, inevitably means requesting data from all nodes.

By the way, the “bucket” in Riak terminology is confusing, disappointing, and does not implement the way you might think of it. Riak calls it a bucket, which is actually just a namespace. There is only one bucket internally, and the key is stored with the bucket name as a prefix. This means that no matter how small the number of buckets is, it will take m times to enumerate the keys in a single bucket of size n, where m is the total number of keys in all buckets.

These restrictions are Basho’s implementation choices, not necessarily design flaws. Cassandra implements the same partition model as Riak, but supports efficient sequential range scanning and mapreduce across a large number of keys. Cassandra also implements real barrels.

In the last days, I had a fight with Riak. The initial setup was easier than I thought. Now I have a 3-node cluster. For testing purposes, all nodes are running on the same vm.

I admit that the hardware settings of my virtual machine are very degraded (1 CPU, 512 MB RAM), but I am still very surprised by the performance of riak.

Map reduction

Use map to reduce a bit. I have about 2000 objects in a bucket, each with a size of about 1k-2k as json. I use this map function:

function(value, keyData, arg) {
var data = Riak.mapValuesJson(value)[0];

if (data.displayname.indexOf("max") !== -1) return [data];
return [];
}

And for execution It took 2 seconds for the http request to return the result instead of calculating the time in my client code to exclude the json result. Deleting 2 of the 3 nodes seems to slightly improve performance for two seconds, but it still seems very slow to me.

Is this expected? The bytes of the object are not very large, and the 2000 objects in a bucket are not that large.

Insert

It took a long time to batch insert about 60.000 objects of the same size as the above, and there is no real work in fact.

My script for inserting objects in riak died at about 40.000 and said it could not connect to the riak node. In the riak log, I found an error message indicating that the node has run out of memory and died.

Question

This is really my first shot in riak, so there is a certain chance that I messed up something.

Are there any settings that can be adjusted?
Are the hardware settings too restricted?
>Maybe the PHP client library I use to interact with riak is the limiting factor?
>It is quite silly to run all nodes on the same physical machine, but if this is a problem-how can I better test the performance of riak?
Is the map reduction really that slow? I read about the performance metrics of the thumbnail feature in the riak mailing list, but if Map Reduce is slow, should you perform a “query” on the data you need in almost real time? I know that riak is not as fast as redis.

If someone with more riak experience can help me solve some of these problems, it will really help me a lot.

This answer is a bit late, but I want to point out that Riak’s mapreduce implementation is mainly for processing links rather than entire buckets.

The internal design of Riak is actually optimized for the entire bucket. This is because the bucket is not considered a sequence table, but a key space distributed on the node cluster. This means that random access is very fast-maybe O(log n), but don’t quote me-and serial access is very, very slow. Serial access, the method currently designed by Riak, inevitably means requesting data from all nodes.

By the way, the “bucket” in Riak terminology is confusing, disappointing, and does not implement the way you might think of it. Riak calls it a bucket, which is actually just a namespace. There is only one bucket internally, and the key is stored with the bucket name as a prefix. This means that no matter how small the number of buckets is, it will take m times to enumerate the keys in a single bucket of size n, where m is the total number of keys in all buckets.

These restrictions are Basho’s implementation choices, not necessarily design flaws. Cassandra implements the same partition model as Riak, but supports efficient sequential range scanning and mapreduce across a large number of keys. Cassandra also implements real barrels.

Leave a Comment

Your email address will not be published.