Performance - Why is Cassandra secondary index on 350K lines so slow? - 350K, Cassandra, index, on, performance, Second, slow, why

I have a column family with a secondary index. The secondary index is basically a binary field, but I am using a string. The field named is_exported can be “true “Or “false”. After the request, all loaded rows are updated with is_exported =’false’.

I poll this list every ten minutes and export when new rows appear .

But the problem is: I see that the time of this query becomes very linear with the amount of data in the list, and it currently takes 12 to 20 seconds (!!!) to find 5000 rows. According to my understanding , Index request should not depend on the number of rows in the CF, but should depend on the number of rows for each index value (base), because it is just another hidden CF, such as:

"true": rowKey1 rowKey2 rowKey3 ...
 "false": rowKey1 rowKey2 rowKey3 ...

I am using Pycassa to query the data, here is the code I am using:

column_family = pycassa.ColumnFamily(cassandra_pool, column_family_name, read_consistency_level=2)
 is_exported_expr = create_index_expression('is_exported','false')
 clause = create_index_clause([is_exported_expr], count = 5000)
 column_family.get_indexed_slices(clause)

I did something wrong, but I hope this operation can be completed faster.

Any ideas or suggestions?

Some configuration information:

>Cassandra 1.1.0
> RandomPartitioner
>I have 2 nodes, replication_factor = 2 (each server has A complete copy of the data)
>using AWS EC2, large instances
>software raid0 on temporary drives

Thanks in advance!

I don’t know the internal structure of the index in Cassandra, but I assume it behaves in the same way as PostgreSQL / MySQL is similar, indexing boolean values, true/false columns are redundant in many cases. If the cardinality is low (true & false = 2 unique values) and the data is very evenly distributed, for example, ~50% is true and ~50% is false , The database engine may perform a full table scan (no index is used).

The linear relationship between query execution and the size of the data set will further support that Cassandra is performing a full table (key space) scan .

I have a column family with a secondary index. The secondary index is basically a binary field, but I am using a string. The name is_exported The field can be “true” or “false”. After the request, all loaded rows are updated with is_exported =’false’.

I poll this list every ten minutes and check Export when a new row appears.

But the problem is: I see that the time of this query becomes very linear with the amount of data in the list, and it currently takes 12 to 20 seconds (!!!) to find 5000 rows According to my understanding, the index request should not depend on the number of rows in the CF, but should depend on the number of rows for each index value (base), because it is just another hidden CF, such as:

"true": rowKey1 rowKey2 rowKey3 ...
 "false": rowKey1 rowKey2 rowKey3 ...

I am using Pycassa to query data, here is me Code being used:

column_family = pycassa.ColumnFamily(cassandra_pool, column_family_name, read_consistency_level=2)
 is_exported_expr = create_index_expression('is_exported','false')< br /> clause = create_index_clause([is_exported_expr], count = 5000)
 column _family.get_indexed_slices(clause)

I did something wrong, but I hope this operation can be completed faster.

Any ideas or suggestions?

Some configuration information:

>Cassandra 1.1.0
> RandomPartitioner
>I have 2 nodes, replication_factor = 2 (each server has A complete copy of the data)
>using AWS EC2, large instances
>software raid0 on temporary drives

Thanks in advance!

I don’t know the internal structure of indexes in Cassandra, but I assume that it behaves similarly to PostgreSQL/MySQL, indexing boolean values, and true/false are listed in In many cases it is redundant. If the cardinality is low (true & false = 2 unique values) and the data is very evenly distributed, for example ~50% is true and ~50% is false, the database engine may perform a full table scan (not Use index).

The linear relationship between query execution and data set size will further support that Cassandra is performing a complete table (key space) scan.

Performance – Why is Cassandra secondary index on 350K lines so slow?

Leave a Comment Cancel reply