HORRORBASE AND OTHER DATA FROM THE CRYPT

by Steve Lopez

Back when I performed full-time sales and tech support for ChessBase USA, an interesting phone question I received from time to time went something like, "How can you people justify selling unannotated games when thousands of games are available free over the Internet?"

I would explain to the caller that ChessBase data is checked for accuracy, the headers are complete and standardized, and that such error checking requires time and manpower, hence the charge for the data disks.

To which some doubting Thomases replied, "Yeah, right".

Recently, ChessBase introduced a new CD-ROM database called Horrorbase, over 900,000 games for about $30. A poster on one of the Internet chess newsgroups theorized that ChessBase is offering so many games for such a low price because "the company has finally realized that the bottom has dropped out of the data market".

To which I reply, "Yeah, right".

Don't get me wrong; the Internet is a great way to add data to your collection. But keep in mind that, in general, nobody is checking this data for accuracy, plus the headers of the games are certainly not standardized! And you have absolutely no idea as to the quality of the games you're getting.

Don't believe it? Check the games on a ChessBase data disk and compare them to some of the downloadable files you'll find on the 'Net.

Let's go back and look at some of my recent acquisitions from the ever popular University of Pittsburgh chess FTP site and you'll soon see what I mean. All of these games were downloaded in ChessBase .cbf format.

I obtained 83 games from the 1995 Australian Superleague. The headers contain the players' last names, followed by their full first names, unseparated by a comma. Someone named "Ramakrishna" had no first name available. No ratings are given for the players.

The 1997 Gausdal Open games come pretty close to the accepted ChessBase header standard, but no ratings were given for the players.

The 1997 Ano Liosia tournament in Athens gives ratings, but these appear in the same header field as the players' names, instead of in the standard Elo fields. Also, the players' first names seem to have been abbreviated pretty much wherever the person who entered the data felt like it; some games give a one-letter first initial, other give a multi-character abbreviation. Some of the first names are given in full.

Some of the games from the 1997 New York Open are incomplete, and some have empty "source" fields.

And the Belgian Team Championship database I downloaded was a travesty. None of the games from Rounds 1 through 6 had a year given, some had incomplete source information, Round 9's source field is filled out differently from the other rounds, and the games switch back and forth from giving full first names to just initials. Try generating a crosstable from this mess!

Now, to be honest, the games are good to have. But forget about trying to standardize the data; you'll spend more time in fiddling with the headers than you will in playing through the games.

While we're on the subject of data quality, let's forget for a moment that many of the games on the 'Net are of questionable quality as far as the strength of the players goes. A more important question is this: how do you know you're getting what you think you're getting?

I'll give you a hypothetical example. A few years ago I played a really great game in a tournament. According to Fritz 3, nearly every move I played was one of the three best moves available in the position (I can assure you that all of this was purely accidental on my part). Now what is to stop me from changing my name in the game header to "Reti,R" and my opponent's name to that of some poor minor slob of the 1920's and slipping this game into an uploaded collection of Richard Reti games? Sure, a strong player or an adept historian would be able to tell that Reti didn't play it, but what about the thousands of unsuspecting players who dump this game indiscriminately into their databases?

Let me assure you that I would never tarnish Reti's good name in this manner. But other people may not be so stricken by conscience. If you've downloaded games from a BBS, an online service, or the Internet you may have games lurking in your database right now that were not even played by the players who are listed in the header.

So what separates the data you can purchase from ChessBase from the data you can download for free? Quality control. ChessBase data disks contain games by strong players, with standardized game headers, and the moves checked for accuracy.

In today's on-line age, anybody can put together a database of a million games. In fact, that's pretty much what ChessBase did with Horrorbase. It's over 900,000 games (I believe from the Internet and CompuServe) just dumped together and slapped on a CD. No editing, no error-checking, no nothing. Let's take a closer look at the terrors that await us within Horrorbase:

First of all, we notice holes in the headers big enough to drive a truck through. In fact, Game #1 has no player or tournament info. It's just a six move line from the A46 Torre Attack. Not a particularly great line, either: 1.d4 Nf6 2.Nf3 e6 3.Bg5 c5 4.c3 h6 4.Bxf6 Qxf6 6.e4 d5! (with gain of tempo). [Note the commentary is from the CD, not me]. Gee, even the commentary is awful. I suspect that this came out of somebody's personal analysis database.

Three of the first twenty games from the database contain no ECO information. None have source or year info. There is no standardization for player names; some have complete names with first names first and last names in capital letters, other use the standard last name-first initial convention. Still others have last names followed by complete first names. And Games 3 and 4 are a guy named M. Abregu playing against himself! I hope it was a correspondence game; I can just hear it now, "I mail the moves to myself because I forget what I was thinking three days ago..."

Moving on, Games 550 and 551 were played by "*" vs. "*". Game 550 is listed as a Black win. The entire game score is as follows: 1.d4 d5 2.Nc3 Nf6 3.e4 e5. That's it.

Game 551 was a win for Black after just five moves: 1.d4 d5 2.Nc3 Nf6 3.e4 dxe4 4.f3 exf3 5.Nxf3 e6. Man, I wish I could get White to resign right out of the opening like that!

Games 1213 through 1227 list the White player as "Undefined", while Games 1275 through 1342 list the White player as "Variation". Both of these sets do list a Black player, however.

Games 3839 through 4021 list both the White and Black players simply as "Computer".

The first 1500 games of the database have neither year nor tournament information supplied. The first 4144 games have no year given.

I'd relate some of the other interesting header info I found in these first 4144 games, but I'm still laughing too hard to talk about it.

Let's look at some examples of what you'll find when you try doing a specific header search. In this case, since I recently finished ghostwriting a book on the Worrall Attack for another publisher, I decided to see what a search of ECO code C86 turned up. I sorted the games alphabetically by White player after the search was completed.

There were 639 games in the list. Believe me when I tell you, there are a lot of duplicates in this number! The famous game between Alexander Alekhine and Paul Keres (Salzburg 1942) appears five times, with the final number of moves being variously given as either 56 or 57. Alekhine's name is also spelled 2 different ways ("Alekhine" and "Aljechin") and there are many other Alekhine duplicates as well.

There are many games which appear twice and sometimes there is absolutely no difference in the header information.

Bronstein-Ulvestad appears 3 times, with two different total move numbers.

We find a lot of computer vs. computer games here. Some are actually pretty interesting from a historical perspective, such as Duchess-Blitz 6.5 from the 1978 ACM tournament. But there are way too many games like Expert 4 mhz-Psion Atari 8 mhz; it's just someone with too much time on his hands having computers play each other and recording the results.

Here's one of my Worrall favories: Frank Poole-Hal 9000, Discovery, 1994. For those unfamiliar with cinematic history, this game is from the film 2001: A Space Odyssey in which an astronaut named Frank Poole (on a ship called Discovery) is seen playing chess against a supercomputer called HAL 9000. The year on the game is wrong -- the movie was released in 1968 and the game allegedly takes place later than 1994. But worse yet, the game has no business being in a database under this header. The actual game (which was "borrowed" by the filmmakers for use in the movie) was played in a 1910 Hamburg tournament by two gentlemen named Roesch and Schlage, and aside from its use in the movie, is otherwise unremarkable (and, btw, the game appears under the correct names twice more in Horrorbase. Both of these other appearances have incorrect information: once it's listed as having been played in 1913, but has the correct gamescore, while the other game has the right year but also has an additional unplayed move added to demonstrate the unstoppable mate).

The game Jorgensen-From had the header typed in with high ASCII characters; the "o" in "Jorgensen" appears as a square.

A 22-move draw played by Mayer and Schwartz appears in the database. Its inclusion is questionable though. It was played in the 1990 Shenandoah Valley Open, which was not a particularly strong event. I know this for a fact; I was a participant, dragging down the average rating and overall level of play.

And we also find the obligatory game from an Internet chess server, in which "Zippy" beat "Lasse" in 42 moves. Anyone saying that "Lasse" should have played the Colle Opening will be shown the door (especially considering that "Colle" is actually pronounced as a single syllable -- "call" -- as opposed to the incorrect way most people say it -- "collie").

I previously stated that a search for C86 turned up 639 games. After running the duplicate kill function in ChessBase 6, that number was knocked down to 624. Many duplicates were missed because of variances in name spelling, source line info, and the fact that the same game appeared twice (or more!) but with extra moves added (as in Roesch-Schlage, mentioned earlier). So the process of cleaning up such a database is complicated by the very errors one is trying to correct!

At first glance, Horrorbase contains 902,000 games. I suspect, however, that due to duplication, the real number is around 700,000.

Is Horrorbase totally worthless? Actually, no. It's a pretty good buy for thrity clams. There is enough interesting and worthwhile stuff here to justify the money spent. There are tons of historical games in the collection, including some neat games from the early days of computer chess. Players from the Washington DC area will get a kick out of Horrorbase; due to the inclusion of a large number of Steve Mayer's games from the last 15 years, I was able to locate a lot of my friends and opponents in Horrorbase (a special "Hello" this week to David Sherman, Dr. Richard Cantwell, Floyd Boudreaux, Phil Collier, Jim Addison, Macon Shibut, Sunil Weeramantry, anybody I missed, and especially Sam Conner), and I suspect other MD/DC/VA players will see familiar names too.

And, for the truly twisted, some of the headers may themselves alone provide tons of typo-induced amusement.

Among the wheat lurks a lot of chaff, though. There are a lot of computer vs. computer games that didn't come from recognized events; they're just games that somebody ran for his own amusement and dumped into a database. There are plenty of analysis lines in Horrorbase: games that end in the opening with an evaluation given in the "result" field. Many have an unknown (amateur?) player listed as the "source"; I suspect that these are lines directly taken from amateurs' repertoire databases.

The whole point of Horrorbase is to demonstrate the qualitative differences between data that is available on-line and data that is available commercially from sources such as ChessBase. Quantity does not equal quality.

However, there are a lot of good, quality games available in Horrorbase and on-line, so don't be afraid to use these resources. The trick is to separate the worthwhile games from the junk. Even if you don't (or can't) do this, there are steps you can take to minimize the damage that junk data can do to your database. We'll look at them in next week's Electronic T-Notes. Until then, have fun!

I'm very interested in reading your opinions of this (or any other) issue of T-Notes, as well as my book Battle Royale. I invite you to post comments to our ChessBase Users Group or else you can e-mail me directly.