DoubleI receives a few comments
Yikes, I'm back on track with my reading and work and development and other stuff. That means I can get back to being a good boy and blogging some more stuff and doing my part to contribute to the greater good with GA and GP stuff for the .NET community and doing some uggily-buggily math on some JJBR stuff. I've been doing some F# stuff as well. You can find the recent F# release information in this post on Dr. Syme's blog. So, yessir, that's me by the JJBR servers doing the happy dance. Let's see if I can't climb back into the zone. My brain finally feels unrestricted.
But today, we have to address this matter of CAPTCHA garbage. DoubleI has been getting comment spam on his blog (that is protected by CAPTCHA). It sounds as though we're talking about a significant volume as it takes 10-20 minutes per day to clean up.
Wow, before I forget, DoubleI (Mr. William G. Ryan) now has an official nickname among the JJBR crowd. He's an Irishman who is not afraid to mix it up a bit on matters where he is competent. I could call him The Irish Instigator, but DoubleI (both the person and the moniker) is way too much cooler than that. Besides, I've already abused the definite article "the" in monikers for the lower cased one and The Armed Geometer. What most might now find disturbing is that "DoubleI" has been added to my spellcheckers as a valid word.
Disclaimer: No, I am not a spammer. I don't engage in activities that assist spammers. I'm genuinely concerned about the state of affairs that we face in this silly spam battle.
So DoubleI just wrote on a recent comment spam attack on his blog. According to his post, the comment API was not attacked, but rather that these comments originated from the web page, i.e. the CAPTCHA used to protect this blog (and others on msmvps.com) was ineffective. Here's an excerpt (a little explicit language warning):
So looks like the lower kced one [lco] .. Yep, I just got NAILED with BLOG SPAM last night, and it wasn't from the API either. It's a great spam too.. encouraging people to spam the sh1t out of public newsgroups a minimum of 200 times each so that they can 'honestly' make money.
DoubleI is smart enough to realize that the CAPTCHA used on msmvps.com can be beaten. Let me take that a step further.
Let's start with a quick review. If you aren't familiar with CAPTCHA, it is an acronym for:
Completely Automated Public Turing Test to tell Computers and Humans Apart
CAPTCHA is a form of a challenge-response test. In a security system, a challenge-response test is simply a test that attempts to provide a test whereby challenge(s) are presented and response(s) are received. If the response(s) are adequate to satisfy the challenge(s), the resultant condition of the test is PASS. If not, the result of the test is FAIL. Challenge-response tests are typically two state tests, i.e. the resultant state of the test upon the presentation of the challenge(s) and the successful collection of the response(s) is either PASS or FAIL. There are no other states, e.g. KINDAPASS, MAYBEJUSTTHISONCEILLLETYOUIN, etc.
UserID/Password entry on most computers is a simple form of challenge-response tests. Provide a UserID with the corresponding Password and you pass the test and are authenticated for use of the computer.
A Turing Test (TT) as described in the above acronym for CAPTCHA is a test proposed by Alan Turing more than 50 years ago. Dr. Turing, a pioneer in computing, mathematics and cryptanalysis, proposed the test in a 1950 publication titled "Computing Machinery and Intelligence". The test provides for an observer to interact with two parties; one a computer and the other a human. This is done in such a way as to hide the identities and other observable properties of the parties. The observer is only capable of communicating with the two parties through a natural language conversation (as originally proposed, the natural language conversation would occur on teletype machines). If the observer is unable to determine (or judge) which of the two parties is human in some reliable form, the computer is said to pass the Turing Test.
Let's look at the iterations of co-evolution of the CAPTCHA. The parties in the co-evolution are the presenter and the adversary. Presumably, the adversary has some incentive (financial or otherwise) to bypass the CAPTCHA as given by the presenter. The presenter in this case is the blog owner/host. The adversary is any party that has non-extradition protection and spam capacity.
Let's introduce the term high-certainty here. High-certainty for the presenter is the state of the system such that humans can pass the test at a near 100% rate and computers can pass the test at some low threshold, typically less than .1% or some other low number. If the certainty is above a threshold of 10% as an example for the adversary, the certainty becomes a measure of deterrence, i.e. if the success rate of bypassing the CAPTCHA is 10% and I'm spamming, 1 in 10 messages will get through. I can then measure capacity of my server farm and determine if this is an acceptable success/failure rate (it likely is sufficient to do a significant spam run). High certainty for the adversary is the ability to pass the test at a rate that exceeds the success/failure ratio of the capacity of the server farm that performs the spam distribution, i.e. the 10% as measured above would be considered high certainty. If the number is significantly higher, it reduces the distribution cost. Note that the utility function for spam, i.e. the payoff for the successful connection to sell product, commit fraud, etc., is a low-yield function and accommodates that yield through volume. The cost of improving the certainty is measured against the cost of distribution. At the point where these two meet is the equilibrium in cost. Improving certainty is cheap on this curve only up to the point of equilibrium.
Iteration 1
CAPTCHA uses the above principles to provide a challenge-response test when used in blogs to protect the blog from comment spam. The usual test is a visual presentation of 6 characters that is usually delivered in a jpg or gif. To pass the test, the human can read the picture and identify the 6 characters and enter them in the response text box. A computer presumably cannot pass the test because the picture is pixels and not ASCII text. Therefore, the CAPTCHA as a TT is an attempt to separate human entry from computer entry. At this most simple iteration, the CAPTCHA is a valid test that distinguishes between the computer and the human. Since the computer fails the test, i.e. cannot provide the response to the simple challenge, the comment is not accepted. On DoubleI's (wow that's a scary looking possessive form) blog, the error message is "Invalid Human Proof". Nice. So at the end of this iteration, the presenter has the appropriate protection.
Iteration 2
Because of advances in Optical Character Recognition (OCR) and other visual perception technologies, the presentation of a CAPTCHA picture as described above can now be given to simple OCR software for character identification. In this iteration, the simple picture with characters has a high probability of being recognized (for most OCR, failure of one character to be recognized is typically a 5 sigma independent event). So by the end of this iteration, the adversary has a method to bypass (with high certainty), the protections of the CAPTCHA.
Iteration 3
Having proven ineffective, the CAPTCHA must now progress to a point where the text must still be readable by humans, but fail with high certainty for the adversary. So the CAPTCHA presenter provides for ways to modify the picture such that there is interference introduced in the form of color changes and miscellaneous background noise, e.g. dots, dashes, random lines, etc. Additionally, the letters in the presentation may also be modified to look "wavy" or "bent" or other modification that increases the failure rate of standard OCR. At the conclusion of this iteration, the presenter reclaims the high-certainty position.
Iteration 4
The adversary cannot allow the presenter to block usage of the resource because the resource provides some incentive (again, incentive is abstract, but presumably financial). The introduction of other forms of visual recognition are used by the adversary. The methods used may be in reductionist mathematics, AI or other areas. At the end of this iteration, the adversary is placed in the high-certainty position.
Other Forms
We, you, me, anyone else trying to use/view the web, find ourselves in iteration 4. The adversary currently has the high-certainty position. There are very few current CAPTCHA programs that cannot be bypassed.
So during the above discussion, the use of characters was the test. The human presumably could detect and identify the characters while the computer should *not* have been able to detect and identify. Any further obfuscation or introduction of interference into this form of CAPTCHA is not likely to improve these results, i.e. the computers win; humans lose.
Other forms of CAPTCHA are also available, but are subject to similar problems. The most common form is the picture CAPTCHA as described here and by the lower cased one here. While computers can be trained to recognize the subject of pictures, the introduction of interference into the pictures can make for higher certainty results for the presenter. Again though, interference is just the application of a math function to a domain, i.e. unless the interference is one-way hard (and P!=NP or other complexity silliness), the discovery of algorithms to overcome the introduction of interference should emerge based on the utility function and cost to protect the content.
Other forms of picture CAPTCHA can be circumvented with simple social engineering. Take the example of a site that hosts adult material that also is in the business of spamming. Provide free downloads of adult content to users of the adult site by asking the user to identify THE SUBJECT OF A PICTURE. The picture presentation is taken from the candidate CAPTCHA library. So if the target CAPTCHA library contains 1,000,000 pictures, the adult site host/spammer needs some fraction (as measured by the equilibrium of the cost of certainty) of the 1,000,000 pictures identified to perform an attack on picture CAPTCHA in this manner. Given the traffic to adult sites, this volume seems trivial.
I would normally approach Interactive Proof Systems and Zero-Knowledge Proofs at this point. These likely could provide some relief from the current state of affairs, albeit temporary relief. I'll leave it to the reader to review these works.
Let's state the real problem out loud:
The ability to test for the difference between a human and an automated computing system is becoming increasingly difficult. While we haven't achieved a successful TT, we have now crossed a boundary whereby the challenge-response test cannot be simple and short to be effective.
Essentially, I'm trying to politely state that if tests to reduce spam (in this application of CAPTCHA or other challenge-response test) are to be effective, they likely must become more difficult for the human user. Again, the utility function of the cost of use is laid on the shoulders of the legitimate user rather than the adversary.
Let me recommend a new CAPTCHA backronym:
Can't Always Protect The Content for Horrific Acquisition
So I feel bad for DoubleI, but the state of affairs finds us in a situation where the adversary has the upper hand.
-----
As usual, I can be reached for comment here. If you are a proponent of CAPTCHA, please include some semblance of a defense of the theory or a particular implementation.