dimanche, 25 avril 2010

Implementing unique constraints with Cassandra

I was talking about my "Doodle clone with Cassandra backend" in a previous post, I want to explain a little bit how I implement 2 kinds of constraints with Cassandra.

Unique name constrain

When subscribing to a poll, a unique constraint on the name of the subscriber is applied. There is a first check in Javascript in the GUI, but there is also a check of the uniqueness of the names on the server side.

Before going deeper, I just need to say a word about the structure used :

To store the subscribers, I use a SuperColumnFamily, with the ID of the poll as key. For a poll having key "1234", it gives me something like :

"1234" => {
"subscriberKey1" => { name = "subscriberName1", options = ... }
"subscriberKey2" => { name = "subscriberName2", options = ... }
...
}
The trick about the subscriberKey* is to generate a unique ID based on a timestamp in order to be sorted by Cassandra.
The key will be something like : currentTimestamp - ClusterNodeId - ThreadId. Adding the ThreadId into the key give me the guarantee that using the System.currentTimeMillis() will be unique, because the processing will unfortunately take more than a millisecond :)

Using this key, I will write the subscriber row with ConsistencyLevel.QUORUM, and then I will read all the rows till the one I have inserted (with ConsistencyLevel.QUORUM again).
At this point I use Cassandra magic (rows are ordered) to be sure that I have selected all the subscribers they subscribe before the current one.
I remains me to check manually the uniqueness of the name. If another row has the same name, I will simply remove the current subscriber.

Limiting the number of subscribers per option

Another constraint I have implemented is the limitation of the number of subscribers per option.

The principle is the same as the unique name constraint : I count the number of previously inserted subscriber.
One point to keep in mind is that we can read a row with a duplicated name, which will be deleted later so it should not be taken into account for the actual constraint. This mean that we need to check if we are not counting duplicated names in this constraint too...

But the good news about this constraint is that it show a way to implement counters in Cassandra. Maybe I will write another post about this topic. The bad news is that we need to check the duplicated names in several places.

Limitation of the model

Complexity : adding more constraints grows the complexity in O(n^2), as every new constraints (duplicated names) should be implemented into every other constraints (limitation of subscribers per option).

Scalability : this model will scale very well for billions of polls with some tens of users. The inverse, some tens of polls with billions of users, needs a lot of resources for checking constraint, which will make the performance drop.

Aucun commentaire: