vendredi, 5 août 2011

Cassandra, hector, maven and tomcat

Cassandra (v0.7 and 0.8), hector (version compatible with Cassandra), maven and tomcat bound together result in a warning at tomcat startup :

INFO: validateJarFile(/...<webappspath>.../WEB-INF/lib/servlet-api-2.5-20081211.jar) - jar not loaded. See Servlet Spec 2.3, section 9.7.2. Offending class: javax/servlet/Servlet.class

Nothing harmful here, but as we want to do things as clean as possible we should avoid violating Servlet Spec 2.3.

So running a dependency:tree help to find out the guilty dependency :

[INFO] +- me.prettyprint:hector-core:jar:0.8.0-2:compile
[INFO] |  +- org.apache.cassandra:cassandra-all:jar:0.8.1:compile
[INFO] |  |  +- org.apache.cassandra.deps:avro:jar:1.4.0-cassandra-1:compile
[INFO] |  |  |  \- org.mortbay.jetty:jetty:jar:6.1.22:compile
[INFO] |  |  |     \- org.mortbay.jetty:servlet-api:jar:2.5-20081211:compile

Hum hum, org.mortbay.jetty:servlet-api:jar:2.5-20081211. We found the responsible of the problem.

Using <dependencymanagement> in our pom.xml file we will be able to mark this dependency as provided and the problem should flight away :
  <dependencymanagement>
    <dependencies>
      <dependency>
        <groupId>org.mortbay.jetty</groupId>
        <artifactId>servlet-api</artifactId>
        <version>2.5-20081211</version>
        <scope>provided</scope>
      </dependency>
    </dependencies>
  </dependencymanagement>
Build, deploy, and that's it !

samedi, 16 juillet 2011

More Interview Questions

All interviews are different, all candidates are different, all positions are different. The goal here is to provide some general questions to test engineering skills of candidates. These questions are common in recruiting process in technical companies like Google or Verisign.

How could you detect a loop in a linked list ? 

It's say linked list, to be more precise you only have a pointer to the head of the list, and you know how to move to the next element of the list.
Of course you are not allowed to modify the list at all. And obviously you should try to propose an algorithm that do it in the most efficient way, let say you should try to target O(n).

In more programmer oriented : write a function A that take as parameter the first element of a List and return a boolean with value true if the given List has a loop, false otherwise. You know how to pass to the next element of the List : just call element.next.

For those who want to try to answer by themself, stop reading here.

One good answer is to have 2 iterators that walk through the elements of the List, but at different speed. One iterator move from 1 element per iteration, the second one move from 2 elements per iteration, and check if it meet the first iterator during it's move. How fast goes the second iterator can be configurable.
If at one point the second iterator (fast moving) meet the first one, there is a loop in the List. If the second iterator reach the end of the List, there is obviously no loop.

In pseudo code, it would be :

function A (head element of List list) return boolean :
it1 is initialized to the head element of list
it2 is initialized to the head element of list
while it2 has not reach end list
    for number of element it2 will move
        it2 move to the next element.
        if the element of it2 is equal to the element of it1
            there is a loop ! return true
    it1 move to the next element
there is no loop. return false
As targeted, this algorithm will be on complexity of O(n), as the worst case is the last element looping to itself. When the last element loop to the first one, the algorithm will detect the loop in n / speed of the second iterator.
And finally, the memory footprint is very low.

What are the advantages of a LinkedList over an ArrayList ? 

This question is more often ask in a different way : what are the differences between a LinkedList and an ArrayList. But turning the question up side down like here is a bit more interesting. So what are the advantages ?

The basic advantage of the LinkedList is its O(1) insertion and deletion complexity for any given element, while ArrayList need to move back or forth the elements after the given element. Another advantage is the linearly increasing memory footprint of a LinkedList, when ArrayList memory (still linear but) increases by steps, and then perform sometimes costly memory copies. With such memory copies ArrayList have an amotized complexity in O(1), but this leads to non deterministic operations, which are unwanted when dealing with tight SLAs.
One more hidden advantage of the LindekList is still in the topic of memory usage : it does not require to allocate blocks of memory, elements stored in LinkedList do not require to be allocated in contiguous memory. Just notice that fragmentation of the memory is not always seen as good point.

lundi, 7 mars 2011

NoSQL Overview @Webmardi

Tuesday 1st of March 2011 I gave an overview of NoSQL at the Webmardi monthly event. The presentation was hosted by Webdoc. Thanks all for the warm welcome !

Some pictures :




The slides are available on Slideshare : NoSQL Overview, implementation free.

It is called "Implementation free" because I didn't want to focus on a particular implementation of a NoSQL datasore, but rather generalize concepts of "What is NoSQL". And it's a tricky answer to formulate :)

So what is NoSQL ? It's a family of datastores that does not have SQL as main data access pattern.

The presentation was hosted by Webdoc : http://www.webdoc.com.

jeudi, 6 janvier 2011

Eventually Consistency demystified

In my crusade into the NoSQL world, Eventually Consistency is everywhere. I want to demystify this property a little bit here.

But let's begin with an example to have the same base for the discussion :

  • Let "Node1", "Node2" and "Node3" be three nodes (servers) that are part of our distributed datastore.
  • Let "User A", "User B", "User c" be three users wanting to read and write data in our fictive distributed datastore.
At time (1), "User A" write the value "A" to "Node1". "Node1" will replicate asynchronously this value to both "Node2" and "Node3" (specific to my example).
At time (2) the write call of "Node A" returns. But the replication of value "A" hasn't been completely propagate to "Node2" and "Node3".
At time (3), "User B" and "User C" will read value "A" from "Node1" and "Node2" respectively. "User B" got the latest value (because it reads the node which initiate the update), "User C" will read either the old or the new version of "A", but without any guarantee regarding what it will read.



In a future time (5), "User B" and "User C" re-read value "A" and then got the same value. At this point of time, the datastore is consistent.

Immediate Consistency

In a Immediate Consistency, opposing Eventually Consistency, the write call from "User A" should wait till the replication is done on other nodes before returning, and replica nodes ("Node2" and "Node3") should be synchronized to expose the new value at the same time.

Moreover, if "Node1" is unable to talk to "Node2", the write replication will probably fail then the write call from "User A" will fail.

As we can notice, Immediate Consistency is hard to scale (see two-phase commit or paxos algorithm), because it increases the latency of the writes and makes the system not redundant to failure.

Trade-off for scaling writes

Eventually Consistency is then a trade-off for scaling writes that seems reasonable in certain use-cases.

jeudi, 23 décembre 2010

DNSSEC NSEC3 domain hash computation algorithm

DNSSEC is a DNS extension in order to authenticate and ensure integrity of DNS responses, in order to offers protection against DNS spoofing.

DNSSEC comes with two "denial of existence" mechanism : NSEC (RFCs 4033, 4034, 4035) and NSEC3 (RFC 5155).

Now how "denial of existence" works ?

When a query is performed on a non-existing domain, a specific answer is returned to the resolution client, given the closest domains that are alphabetically before and after the queried domain. But what is very sensible in this way of proving the non-existence of a domain is that we can easily enumerate the whole zone.

That's why NSEC3 was designed to prove the non-existence of a domain, but in the same time to avoid the zone walk through.
Instead of simply returning the closest domains, it returns a hash of the domains.

How to compute NSEC3 Hash ?

I will detail a little bit how this NSESC3 hash is computed :

I you have a look at a zone, you will find additional records, like NSEC3PARAM :

example.com. NSEC3PARAM 1 0 12 aabbccdd
The format of such record is composed of :
  • an algorithm field. 1 means SHA1
  • a flags field
  • an iterations field
  • a salt, represented as a sequence of case-insensitive hexadecimal digits.
Then the hashing algorithm is given by :
 IH(salt, x, 0) = H(x || salt), and
IH(salt, x, k) = H(IH(salt, x, k-1) || salt), if k > 0
With my example.com domain, the hash algorithm will be :

IH(fromHexStringToByte("aabbccdd"), toCanonicalWireFormat("example.com"), 12)

fromHexStringToByte is a base 16 decoder : fromHexStringToByte("aabbccdd") = [0xaa, 0xbb, 0xcc, 0xdd]. See RFC4648

toCanonicalWireFormat convert the domain in wire format using its canonical form : toCanonicalWireFormat("example.com") = [0x07, 0x65, 0x78, 0x61, 0x6d, 0x70, 0x6c, 0x65, 0x03, 0x63, 0x6f, 0x6d, 0x00]. See RFC4034 (canonical form), RFC3845 (wire format)

And that's it, you are now able to compute the NSEC3 hash of your favourite domain. You just need to wait for NSEC3PARAM to be published in the respective zone to got all the necessary parameters :)

vendredi, 19 novembre 2010

Using java.net.InetAddress.getByName() to validate an IP Address may be too permissive

java.net.InetAddress.getByName() static function can be used for validating a given IP Address.

But doing that way you need to pay attention to certain side effects border values :

String ip = "1.2.3.4"; // valid IP Address
InetAddress address = InetAddress.getByName( ip );
Assert.assertEquals( ip, address.getHostAddress() ); // Pass. OK

String ip = "1.2.3"; // invalid IP Address
InetAddress address = InetAddress.getByName( ip );
Assert.assertEquals( ip, address.getHostAddress() ); // Pass. KO !
If you have a look a little deeper, you will see that 1.2.3 is transformed in 1.2.0.3, which after transformation become a valid IP Address.

Having that in mind, a IP Address validation function can look like :
public static boolean validateIpAddress( String anIpAddress ) {
try {
InetAddress address = InetAddress.getByName( anIpAddress );
return address.getHostAddress().equals( anIpAddress );
} catch (final UnknownHostException e) {
return false;
}
}

samedi, 9 octobre 2010

Moving from ELCA to Verisign

I had a great time at ELCA Informatique SA, working for almost 3 years on Secutix, a complete ticketing system. But now it's time for me to move forward and see something else : different projects, different colleagues, different cultures, speak more English, nearer from home, and so on...

That's why I will move to Verisign, beginning the 2nd of November (the 1st is day off :)). Working on the heart of Internet (aka DNS) is some kind of child's dream which I will be able to realize ! I'm really excited to see how root name servers work, what's behind Registrars (aka registry), implementing DNSSEC or what means 1 billion transactions a day.

I really look forward to start there, and in the same time wish all the best to the Secutix team !

lundi, 2 août 2010

Dealing with old SSL certificats (algorithm check failed: MD2withRSA is disabled)

I faced the problem of SSLHandShakeException, or "algorithm check failed: MD2withRSA is disabled" when upgrading above java 1.6.0_17.

The source of problem is well known, the MD2withRSA has been removed from the JVM because it is no more secure (References : CVE-2009-2409 deprecate MD2 in SSL cert validation, Sun java 1.6u17 release notes).

Lot of posts around the problem give the solution to update the certificate to use another signature algorithm, for example SHA1withRSA.

But what to do when the certificate is not under our hands ?

I will try to explain the problem and how I solved it.

Analyse the certificate's chain

First of all is to analyse the chain of certificate to see which one is using the deprecated algorithm.

The following command will print all the certificates in the chain and store them under the names level0.pem, level1.pem and so on.

openssl s_client -showcerts -connect xxxxxxxxxxxxxxxxx:443 < /dev/null | awk -v c=-1 '/-----BEGIN CERTIFICATE-----/{inc=1;c++} inc {print > ("level" c ".pem")}/---END CERTIFICATE-----/{inc=0}'; for i in level?.pem; do openssl x509 -noout -serial -subject -issuer -in "$i"; done
The output given in my case :
  • Cert1, serial=xxxxxxxxxxxxxxxxxxxxx
    subject= xxxxxxxxxxxxxxxxxxxxxxxx
    issuer= /C=ZA/O=Thawte Consulting (Pty) Ltd./CN=Thawte SGC CA
  • Cert2, serial=30000002
    subject= /C=ZA/O=Thawte Consulting (Pty) Ltd./CN=Thawte SGC CA
    issuer= /C=US/O=VeriSign, Inc./OU=Class 3 Public Primary Certification Authority
  • Cert3, serial=70BAE41D10D92934B638CA7B03CCBABF
    subject= /C=US/O=VeriSign, Inc./OU=Class 3 Public Primary Certification Authority
    issuer= /C=US/O=VeriSign, Inc./OU=Class 3 Public Primary Certification Authority
We have a chain of 3 certificates, Cert1 signed by Cert2 signed by Cert3 signed by Cert3 (selfsigned).

Note : the first error here is that the webserver gives explicitly the entire chain of certificates. Thus the change of any root certificate provided by the jre will not be taken into account automatically.

Digging into the details of the third certificate (openssl x509 -in level2.pem -text) show that this certificate is signed with the old algorithm :
Certificate:
Data:
Version: 1 (0x0)
Serial Number:
70:ba:e4:1d:10:d9:29:34:b6:38:ca:7b:03:cc:ba:bf
Signature Algorithm: md2WithRSAEncryption
Issuer: C=US, O=VeriSign, Inc., OU=Class 3 Public Primary Certification Authority
Validity
Not Before: Jan 29 00:00:00 1996 GMT
Not After : Aug 1 23:59:59 2028 GMT
So not only the way of providing the full chain is wrong in point of view of the server configuration, but the last certificate was never updated and thus still use the deprecated signature algorigthm.

Correct the certificate's chain

Now we have identified the weak link of the chain, what's the next step ?

The key point here is to correct the certificate's chain in order to make it more compliant to the standards :

As the third certificate is a well know root certificate (named verisignclass3ca in the jre cacerts), a working version of it is provided in the cacert of java sun (debian oriented commande line, with "changeit" as password) :
keytool -keystore /etc/java-6-sun/security/cacerts -exportcert -alias "verisignclass3ca" | openssl x509 -inform der -text
will output :
Certificate:
Data:
Version: 1 (0x0)
Serial Number:
3c:91:31:cb:1f:f6:d0:1b:0e:9a:b8:d0:44:bf:12:be
Signature Algorithm: sha1WithRSAEncryption
Issuer: C=US, O=VeriSign, Inc., OU=Class 3 Public Primary Certification Authority
Validity
Not Before: Jan 29 00:00:00 1996 GMT
Not After : Aug 2 23:59:59 2028 GMT
which is the same as before, but with sha1WithRSAEncryption instead of md2WithRSAEncryption.

Thus we can simply remove it from the chain and validate normally the certificate's chain with only Cert1 and Cert2. The validation from Cert2 to "verisignclass3ca" will be automagically done.

To be able to do this, we need to create a custom X509TrustManager which will behave the following :
  1. if the certificate's serial is the one of Cert1 (serial=xxxxxxxxxxxxxxxxxxxxx), then remove the last certificate of the chain and revalidate.
  2. In all other cases, validate normally the certificate's chain.
Thus the custom X509TrustManager act as delegate, with one little variation (note : if someone know a easier way to instantiate the default TrustManagerFactory of TrustManager, please leave me a comment). Refer to my article about blind SSL factory for httpClients to see how to pass a custom TrustManager to a SSL factory.

My custom X509TrustManager class will finally look as the follow :
public class CustomX509TrustManager implements X509TrustManager {

X509TrustManager delegate = null;

public CustomX509TrustManager(KeyStore keystore) throws NoSuchAlgorithmException, KeyStoreException, CertificateException {
// Instantiate the default X509TrustManager
TrustManagerFactory factory = TrustManagerFactory
.getInstance(TrustManagerFactory.getDefaultAlgorithm());
factory.init(keystore);
TrustManager[] trustManagers = factory.getTrustManagers();
if (trustManagers != null && trustManagers.length > 0) {
for (int i = 0; i < trustManagers.length; i++) {
TrustManager trustManager = factory.getTrustManagers()[i];
if (trustManager instanceof X509TrustManager) {
delegate = (X509TrustManager) trustManager;
break;
}
}
}
if (delegate == null) {
throw new CertificateException("Cannot found any instance of X509TrustManager");
}
}

public X509Certificate[] getAcceptedIssuers() {
if (delegate == null) {
return null;
} else {
return delegate.getAcceptedIssuers();
}
}

public void checkClientTrusted(final X509Certificate[] c, final String a)
throws CertificateException {
if (delegate != null) {
delegate.checkClientTrusted(c, a);
} else {
throw new CertificateException("Unable to validate this certificate (delegate is null).");
}
}

public void checkServerTrusted(final X509Certificate[] c, final String a)
throws CertificateException {
if (delegate != null) {
// hardcoding test to be sure we are trying to validate the right certificate
if (c.length == 3 && c[0].getSerialNumber().toString(16).equals(BADCERTIFICATESIGNATURE)) {
c = new X509Certificate[] {c[0], c[1]};
}
delegate.checkServerTrusted(c, a);
} else {
throw new CertificateException("Unable to validate this certificate (delegate is null).");
}
}

private static final String BADCERTIFICATESIGNATURE = "xxxxxxxxxxxxxxxxxxxx";
}
With this implementation I'm sure that the my invalid certificate will continue to work, and it cannot be faked (as it could be with a completely blind TrustManager). But I can also expect that when the provider will update its certificate, everything will continue to work as expected because all other certificates are still validate the regular way.

Java and certificate's mess : blind SSL factory for commons httpClient

There are parts of Java which are not very "quick development" oriented. The side I want to illustrate here is dealing with SSL.

When the application is in early development stage, we don't want to bother with stuff like valid certificate. We don't want to get stuck for days because we haven't received valid development SSL certificate from Verisign or other "more trusted than me" monopoles. We just want to activate the "--no-check-certificate" à la wget, or "CURLOPT_SSL_VERIFYPEER" à la PHP.

Adding a new certificate in the cacert file is not so complicated, but it is not always possible on every computers (should developers have administrators rights on their computers ?), or break the debian pakage integrity and risk to be overridden at the next JRE update.

No, for development phase, a blind SSL factory is a good point for productivity. Simply DO NOT FORGET TO REMOVE IT in the integration or pre-production phase. Having valid and trusted SSL certificate in production is NOT an optional thing.

What is a blind SSL factory

A blind SSL factory is simply an SSL factory which will validate any certificate. Expirated, selfsigned, but also man-in-the-middled or forged certificates will be trusted.

Mike McKinney already explained how to create a blind SSL factory for LDAP connexions.

I will leverage his explanations to enable blind SSL factory for commons httpClient.

The start point is to create a TrustManager which will trust any certificate, and then give this TrustManager to the SSL factory. This will result in a blind SSL factory :)

Anonymous class that implements X509TrustManager :

class BlindX509TrustManager implements X509TrustManager
{
public X509Certificate[] getAcceptedIssuers()
{
return null;
}
public void checkClientTrusted(final X509Certificate[] c, final String a)
{
}
public void checkServerTrusted(final X509Certificate[] c, final String a)
{
}
}
Initializing the SSLContext to retrieve a SocketFactory :
SSLContext sc = SSLContext.getInstance("SSL");
sc.init(null, new TrustManager[] { new BlindX509TrustManager() }, new java.security.SecureRandom());
SocketFactory blindFactory = sc.getSocketFactory();
And finally the BlindSSLProtocolSocketFactory which implements the needed ProtocolSocketFactory of httpClient :
public class BlindSSLProtocolSocketFactory implements SecureProtocolSocketFactory
{
public Socket createSocket(String host, int port, InetAddress clientHost, int clientPort) throws IOException,
UnknownHostException
{
return blindFactory.createSocket(host, port, clientHost, clientPort);
}
public Socket createSocket(String host, int port, InetAddress localAddress, int localPort, HttpConnectionParams params)
throws IOException, UnknownHostException, ConnectTimeoutException
{
if (params == null)
{
throw new IllegalArgumentException("Parameters may not be null");
}
int timeout = params.getConnectionTimeout();
if (timeout == 0)
{
return createSocket(host, port, localAddress, localPort);
}
else
{
return ControllerThreadSocketFactory.createSocket(this, host, port, localAddress, localPort, timeout);
}
}
public Socket createSocket(String host, int port) throws IOException, UnknownHostException
{
return blindFactory.createSocket(host, port);
}
public Socket createSocket(Socket socket, String host, int port, boolean autoClose) throws IOException, UnknownHostException
{
return ((SSLSocketFactory) blindFactory).createSocket(socket, host, port, autoClose);
}
public boolean equals(Object obj)
{
return obj != null && obj.getClass().equals(getClass());
}
public int hashCode()
{
return getClass().hashCode();
}
}
Everything tied up together, put into the classpath of the application, will override the https protocol with the BlindSSLProtocolSocketFactory and thus allow your code to connect any certificate !
public class BlindSSLProtocolSocketFactory implements SecureProtocolSocketFactory
{
private static final Log LOG = LogFactory.getLog(BlindSSLProtocolSocketFactory.class);
public BlindSSLProtocolSocketFactory()
{
blindFactory = createBlindSocketFactory();
}
public Socket createSocket(String host, int port, InetAddress clientHost, int clientPort) throws IOException,
UnknownHostException
{
return blindFactory.createSocket(host, port, clientHost, clientPort);
}

public Socket createSocket(String host, int port, InetAddress localAddress, int localPort, HttpConnectionParams params)
throws IOException, UnknownHostException, ConnectTimeoutException
{
if (params == null)
{
throw new IllegalArgumentException("Parameters may not be null");
}
int timeout = params.getConnectionTimeout();
if (timeout == 0)
{
return createSocket(host, port, localAddress, localPort);
}
else
{
return ControllerThreadSocketFactory.createSocket(this, host, port, localAddress, localPort, timeout);
}
}
public Socket createSocket(String host, int port) throws IOException, UnknownHostException
{
return blindFactory.createSocket(host, port);
}
public Socket createSocket(Socket socket, String host, int port, boolean autoClose) throws IOException, UnknownHostException
{
return ((SSLSocketFactory) blindFactory).createSocket(socket, host, port, autoClose);
}
public boolean equals(Object obj)
{
return obj != null && obj.getClass().equals(getClass());
}
public int hashCode()
{
return getClass().hashCode();
}
private SocketFactory blindFactory = null;
private static SocketFactory createBlindSocketFactory()
{
try
{
SSLContext context = SSLContext.getInstance("SSL");
context.init(null, new TrustManager[] { new BlindX509TrustManager() }, null);
return context.getSocketFactory();
}
catch (Exception e)
{
LOG.error(e.getMessage(), e);
throw new HttpClientError(e.toString());
}
}
}
And to regirster the new protocol as the default https protocol :
Protocol.registerProtocol("https", new Protocol("https", (ProtocolSocketFactory) new BlindSSLProtocolSocketFactory(), 443));

Please note that if you want only access a selfsigned certificate you have better to use EasySSLProtocolSocketFactory from httpClient contibs. It still validate the selfsigned certificate, so expired or bad certificate will be denied.

dimanche, 25 avril 2010

Implementing unique constraints with Cassandra

I was talking about my "Doodle clone with Cassandra backend" in a previous post, I want to explain a little bit how I implement 2 kinds of constraints with Cassandra.

Unique name constrain

When subscribing to a poll, a unique constraint on the name of the subscriber is applied. There is a first check in Javascript in the GUI, but there is also a check of the uniqueness of the names on the server side.

Before going deeper, I just need to say a word about the structure used :

To store the subscribers, I use a SuperColumnFamily, with the ID of the poll as key. For a poll having key "1234", it gives me something like :

"1234" => {
"subscriberKey1" => { name = "subscriberName1", options = ... }
"subscriberKey2" => { name = "subscriberName2", options = ... }
...
}
The trick about the subscriberKey* is to generate a unique ID based on a timestamp in order to be sorted by Cassandra.
The key will be something like : currentTimestamp - ClusterNodeId - ThreadId. Adding the ThreadId into the key give me the guarantee that using the System.currentTimeMillis() will be unique, because the processing will unfortunately take more than a millisecond :)

Using this key, I will write the subscriber row with ConsistencyLevel.QUORUM, and then I will read all the rows till the one I have inserted (with ConsistencyLevel.QUORUM again).
At this point I use Cassandra magic (rows are ordered) to be sure that I have selected all the subscribers they subscribe before the current one.
I remains me to check manually the uniqueness of the name. If another row has the same name, I will simply remove the current subscriber.

Limiting the number of subscribers per option

Another constraint I have implemented is the limitation of the number of subscribers per option.

The principle is the same as the unique name constraint : I count the number of previously inserted subscriber.
One point to keep in mind is that we can read a row with a duplicated name, which will be deleted later so it should not be taken into account for the actual constraint. This mean that we need to check if we are not counting duplicated names in this constraint too...

But the good news about this constraint is that it show a way to implement counters in Cassandra. Maybe I will write another post about this topic. The bad news is that we need to check the duplicated names in several places.

Limitation of the model

Complexity : adding more constraints grows the complexity in O(n^2), as every new constraints (duplicated names) should be implemented into every other constraints (limitation of subscribers per option).

Scalability : this model will scale very well for billions of polls with some tens of users. The inverse, some tens of polls with billions of users, needs a lot of resources for checking constraint, which will make the performance drop.