Transterrestrial Musings  


Amazon Honor System Click Here to Pay

Space
Alan Boyle (MSNBC)
Space Politics (Jeff Foust)
Space Transport News (Clark Lindsey)
NASA Watch
NASA Space Flight
Hobby Space
A Voyage To Arcturus (Jay Manifold)
Dispatches From The Final Frontier (Michael Belfiore)
Personal Spaceflight (Jeff Foust)
Mars Blog
The Flame Trench (Florida Today)
Space Cynic
Rocket Forge (Michael Mealing)
COTS Watch (Michael Mealing)
Curmudgeon's Corner (Mark Whittington)
Selenian Boondocks
Tales of the Heliosphere
Out Of The Cradle
Space For Commerce (Brian Dunbar)
True Anomaly
Kevin Parkin
The Speculist (Phil Bowermaster)
Spacecraft (Chris Hall)
Space Pragmatism (Dan Schrimpsher)
Eternal Golden Braid (Fred Kiesche)
Carried Away (Dan Schmelzer)
Laughing Wolf (C. Blake Powers)
Chair Force Engineer (Air Force Procurement)
Spacearium
Saturn Follies
JesusPhreaks (Scott Bell)
Journoblogs
The Ombudsgod
Cut On The Bias (Susanna Cornett)
Joanne Jacobs


Site designed by


Powered by
Movable Type
Biting Commentary about Infinity, and Beyond!

« Oh, Am I Annoying You? | Main | What Took So Long? »

Regex Bleg

Can someone give me a regular expression for my blacklist that would disallow any four-consonant (lower case) string? I've been getting a lot of spam lately like this one:

Name: Ryan
Email Address: ron@fromru.com
URL: http://cowbtclt.com/gcqj/uqml.html

Comments:

Well done!
[url=http://cowbtclt.com/gcqj/uqml.html]My homepage[/url] | [url=http://vprqmclp.com/rsyx/vvwl.html]Cool site[/url]

None of these seem to be real domains, so I don't know what the point is, but they all seem to have at least four consonants in a row. I figure that there are few real words like this, at least in English, so it would keep out the riff raff without impeding genuine commenters.

[Update on Friday night]

OK, as a commenter has pointed out, this would preclude some actual English word (like "strength"). So let's go for five consonants. My goal is to err on the side of letting good posts through.

[Update on Saturday night]

It doesn't catch them all, but I did come up with a good trap for them: q[^ua\ \.\,]

Anything with a "q" in it followed by anything other than a "u" or "a" (or a space, period or comma, so we can write "Iraq") is blocked. A lot of these things have "q"s inserted in them.

[Another update, a few minutes later, after testing]

I'm getting a lot of false positives.

Posted by Rand Simberg at March 31, 2006 12:30 PM
TrackBack URL for this entry:
http://www.transterrestrial.com/mt-diagnostics.cgi/5246

Listed below are links to weblogs that reference this post from Transterrestrial Musings.
Comments

/[^AEIOUaeiou]{1,4}\.html

Note that this doesn't worry about symbols being part of your 'four consonant' string either. I can't think of anything I'd name that way: hh$hh.html, rc#d.html. Ick.

Posted by Al at March 31, 2006 01:32 PM

Try this:

.*[bcdfghjklmnpqrstvwxyz]{4}.*

This will match the entirity of any string that contains four consecutive consonants. However, that would also match strings such as "http" and "html." What you might want instead is a regexp that looks any string that contains four consonants between two slashes:

.*/[bcdfghjklmnpqrstvwxyz]{4}/.*

This may still generate false positives (eg, "http://foo.com/html/goodpage.htm"), but fewer of them. You could also try matching any string that has a slash, four consonants, and a .html ending:

.*/[bcdfghjklmnpqrstvwxyz]{4}\.html.*

However, this won't catch the first URL in the example comment you posted, since that URL contains a "u".

(You can also take the Al's suggestion and swap the "bcd..." string with "^AEIOUaeiou" to find any letter that's not a vowel, as opposed to any letter that's a lowercase consonant.)

Posted by Zach Heaton at March 31, 2006 01:43 PM

Thanks, except I don't need to exclude upper case vowels, just lower. These things pretty much invariably come in as lower case domain strings. Also, why the ".html"? I'm going after the domain, not the page. All I really need to exclude is the string of four consonants, regardless of where it appears in the comment or ping. Also, I need at least four, not between one and four (that would exclude most of the worlds in the English language).

Shouldn't it be /[^aeiouy]\{4,\} ?

Posted by Rand Simberg at March 31, 2006 01:49 PM

Sorry, that first reply was to Al. Again, I'm not trying to look for the whole domain. Zack's solution seems too complicated. I'm just looking for a string of at least four consonants (in which "y" counts as a vowel).

Posted by Rand Simberg at March 31, 2006 01:53 PM

Hopefully nobody has the word "strength" or "html" or "amcgltd" in their URL.

Rather than trying to block all possible offending URLs, why not add a word verification? It works for all those bl0gsp0t blogs.

Posted by Ed Minchau at March 31, 2006 02:01 PM

My regexp is rusty-to-nonexistant, but I'm beginning to think:

*/[bcdfgjklmnpqrstvwxyz][bcdfghjklmnpqrsvwxyz][bcdfghjklnpqrstvwxyz][bcdfghjkmnpqrstvwxyz]/.*

might be useful.

Posted by Phil Fraering at March 31, 2006 03:29 PM

I had a leading '/' and a trailing '\.html' so that it would trigger on the four-consonants in: domain.com/bcdf.html. So it won't trigger on words that happen to have four consonants in random places, only if they're in they're the 'name of the web page', which they are in the example you presented. (Except for the u)

As you noticed, [^aeiou] triggers on any consonant. If you only want to scan the piece between http:// and the next slash, part of Phil's line looks best.

.*http://[^/]*[bcdfgjklmnpqrstvwxyz]{4}\.com.*
which is the same as:
.*http://[^/]*[bcdfgjklmnpqrstvwxyz][bcdfgjklmnpqrstvwxyz][bcdfgjklmnpqrstvwxyz][bcdfgjklmnpqrstvwxyz]\.com.*

This is 'will match anything up to the first "http://", any number of non-slash characters, precisely 4 letters from this list [bcdfgjklmnpqrstvwxyz] followed immediately by ".com" and then optionally more characters.' All of the criteria need to be met for the match - so four consonants that aren't in a domainname that's in a URL won't match. (And add '.ru' as a second domain name just on general principles.)

BTW: in both Zach and mine, we _want_ the '{' treated as a special character. The {4} means 'stuff immediately to the left must happen exactly four times', [^aeiou]{1,4} would mean 'match anything with _only_ non-vowels as the only characters between the slash and the .html, as long as there's 1 to 4 of them'
(So somewhere.com/c.html, somewhere.com/cc.html, somewhere.com/ccc.html, somewhere.com/cccc.html) By limiting the expression with the leading slash and trailing '.html', we're limiting 'matches' to URLs. Or partial URLs I suppose.

I use www.regular-expressionsDOTINFO/reference.html as my regexp reference. I hope this is useful. (Your spam filter wouldn't allow dot info :D)

Posted by Al at March 31, 2006 05:19 PM

Actually, most of these are pretty bad, because you'll hit the non-domain part of the url like "http" or "html" or even "p://" or "rg/d" (as in ".org/default.htm"). What you want is to first isolate the part that is the domain element of the url and then to search for consonants. For example:

/^[^\/]+\/\/[^\/]*?[bcdfgjklmnpqrstvwx z]{4}[^\/]*?\//i

Basically this looks for a string which follows the form of non-slash characters at the start (e.g. "http:"), then doubled slashes, then any number of non-slash characters bordering 4 consonants (e.g. www.flrgpb.com), then a single slash.

I'm not sure how efficient this is but I think it will work better than the others. Also, there are probably easier to follow (read: more maintainable) ways to do this a little more programatically. Meaning, matching the domain segment and extracting it, then clipping off the TLD(s) and performing a match on the "core" domain name.

Posted by Robin Goodfellow at March 31, 2006 05:29 PM

"strengths"

stre[ngths]

Posted by Jim C. at April 1, 2006 12:03 AM

Why not ban "url=" or "[/url]" instead?

If you're running MT 3.2, look at this. It has a reference to my file of filters, which you might find interesting.

Posted by Annoying Old Guy at April 3, 2006 03:07 PM


Post a comment
Name:


Email Address:


URL:


Comments: