by Steve Lopez

The volume of chess data available to all levels of players is expanding at a phenominal rate. I've sometimes heard the corpus of chess data referred to as the "data stream"; actually, it's become more of a data avalanche in the last three years.

Pardon me while I go into my "old-timer" schtick ("back when I was your age, I had to walk fifteen miles barefoot through the snow just to go to the chess club"), but I remember the day when 50,000 games was considered a pretty big database. When I first joined ChessBase USA in 1992, the largest single database we offered was 33,000 games. The first time I heard that, I went "Wow! 33,000 games! Man, I could find just about anything in there!" I have to laugh when I think about that today.

By the following year, the complete collection of data available from ChessBase was about a quarter-million games. I remember writing a 1993 article for the Virginia Chess Federation newsletter in which I mentioned doing a search on a database that size and how a few readers were pretty impressed by it.

Here we are, nearly seven years later, with one million game databases being the norm. Are we really playing that much more chess compared to days gone by?

Not really, but the Internet has changed everything. Back in the early 1990's I used to have to wait two weeks to get the games from a big tournament (when Inside Chess published them); today I can get the games from the Internet later on the same day they're played. Most of us just wait for Monday night when The Week in Chess provides us with the latest roundup and a weekly database of one thousand to two thousand games.

It hasn't been that long ago when gamescores weren't collected at tournaments. The players just recorded their results and pocketed their scoresheets. It's becoming increasingly common to see tournaments in which players are required to turn in a copy of their gamescores (even here in the United States). Much of this stuff now finds its way into the public record via the Internet.

This doesn't take into account the literally millions of games played online every day, both in real time and via e-mail. There are several sites around the Internet where you can download huge databases of games played by average players via the various chess servers. If you had the time, energy, and inclination to go around harvesting all of this data, you could easily put together a database in excess of two million games.

There are a few problems caused by this information explosion, though. The major one is that computer hardware has not yet caught up with the demands of the software and data.

Let's look at an example. Back in 1993, if I wanted to find and eliminate duplicates in a 100,000 game database on my 386 3Mb RAM machine, I needed to start the process in the evening and let it run all night. Even then, there was no guarantee that it would be finished by morning (in fact, I recall killing doubles once on a quarter-million game database and having it run from Friday afternoon to Monday morning while I was out of the office for the weekend).

Now let's come back to the present. If you want to merge two different 1,000,000 game databases and then eliminate doubles on a Pentium III 256Mb RAM machine, you'd better reserve at least a weekend for the process (and probably more).

The simple fact is that chess data is proliferating faster than the hardware's ability to keep up with it.

How does one handle the problem? How do you keep a single master database clean of duplicates if a simple "kill doubles" operation requires hours or days on even the fastest hardware? The answer is simple: you don't.

Most PC users have mammoth hard drives these days. A million games (with annotations) takes up around 250 Mb of storage space. You'd have to accumulate a huge number of games to start making a difference on a ten gigabyte hard drive. Having some duplicates is not as issue as far as hard disk space is concerned.

The area in which duplicate games becomes a factor is that of statistical analysis. For example, if you're studying a particular opening and merge the games into a tree to have a look at the statistical probabilities, you don't want duplicate games skewing the results. So how do you get around this?

It's pretty easy: do your search normally and then copy the games into a new (work) database. Then kill the doubles on the work database. This will only take a few minutes and should give you a relatively clean set of data with which to work. Be aware, though, that multiple spellings of the same player's name will result in the duplicate games being retained, so it's a good idea to make a quick pass through the Player Index to see if anything jumps out at you demanding correction.

I'm frequently asked by users how I have my personal databases organized. Of course, there's no right or wrong way to organize one's data; it's strictly a matter of what works best for an individual user. However, I'll provide the information here just to give you a few ideas on how you might possibly organize your own data.

I have Big Database 99 on one of my computers. I copied it from the CD to the hard drive because data access time is significantly reduced when pulling the data off of a hard drive as opposed to a CD. I don't add games to this database and there's a reson for this: when Big Database 2000 comes out in a few months, I can just replace the previous version with the new one. Everything on Big Database 99 will be included in the next version, so I don't have to worry about losing anything. If I decide to add anything to games in Big 99 (such as my own notes), I copy the game to another database and make the changes there.

I run the ChessBase Opening Encyclopedia, the Gambit Lexicon, the Correspondence database, and the Informant database straight from the CDs. This is because I have small hard drives on my computers and I don't have the room for all of this info. If I had a larger hard drive, I'd certainly copy the databases to it.

The ChessBase Magazines that were released after the introduction of Big Database 99, as well as Informants 71 onward are kept in their own separate databases on the hard drive.

I keep the various data I've downloaded from the Internet in their own separate databases, segregated by source. For example, all of the games from The Week In Chess are in one database, all of the games from the University of Pittsburgh site are in another database, etc. I have over a dozen databases of this type.

Finally, there are individual databases on the openings I play, as well as a variety of databases of my own games (OTB games, postal games, Internet games, games against computers, all kept in separate databases).

Here's how I do a typical opening search. First I search the ChessBase Openings Encyclopedia; this is because I want the opening surveys to be the first games in my database. After the games are found, I copy them into a new database. Next I search the Gambit Lexicon (if appropriate), the Informant CD, and the Correspondence CD, copying the games into the new database as I go along.

The next step is to search all of the databases on my hard drive. It's easy to do this in a single pass: just hold down the SHIFT key and click on each database to be searched. This highlights all of them in grey, showing that they'll be searched. Performing a search pulls up all of the games from these databases that matches the search criteria and puts them on the Clipboard; they can then be dragged-and-dropped into the work database.

After I have all the data stored in the work database, I kill doubles on it (which takes just a few minutes) and then run "Physical deletions" on it to actually remove the duplicate games.

How do I determine the source of a particular game once I've copied it into a new database? This requires a bit of preparation beforehand, at the time I download a set of games from the Internet. Let's have a look at the procedure I use.

The first thing I do after unzipping a file from the 'Net is convert it to .cbh format (if it's not already in that format -- I just drag and drop the games into a .cbh database). Next I add the source to the game headers. For example, I've just downloaded The Week in Chess issue 252 to my hard drive. I create an icon for it in ChessBase 7, click once on it to highlight it, and then click the "Annotator Index" button at the bottom of the Database window. In the window that appears, I see the entry "No annotator". I right-click on this line, select "Edit" from the menu, and type "TWIC 252" in the box. After clicking "OK", the words "TWIC 252" are added to the "Annotator" line in the game headers to allow me to easily identify the source of the gamescore.

Next I click on the "Technical menu", then on "Opening classification", and then on "Set ECO codes". This will call up the function that puts ECO codes in the game headers (allowing searches for games by opening codes). I run this function. Next I click on the "Opening Key for current database" button. I have a modified version of the old early-90's "Mainbase" key, into which I've added keys for some of the oddball openings I play. I attach this key to the database I've downloaded.

I next drag-and-drop the whole shebang into my TWIC database. Then I start doing searches for the individual openings I play, copying the games to the relevant databases as I find them. This is the primary reason why I put the source of downloaded games in the "Annotator" field -- if I come across two conflicting gamescores in an opening database, this helps me decide which one to keep.

I can think of a few reasons why someone would be opposed to having multiple databases instead of a single master database. As I said earlier, everythng is a matter of preference. I find that keeping my data segregated by source works best for me; if I want to know how often a certain opening variation appears in all 70+ volumes of the Informant, I can easily find out by doing a search of just my Informant database(s) without having to weed out information from other sources. However, I can see where having a single large database could be very convenient.

As always, one size doesn't fit all. But I've talked to a lot of users who are frustrated by the amount of data swamping their hardware's ability to deal with it; hopefully, I've eased the frustration somewhat by providing some workaround ideas.

Until next week, have fun!

My gambit homepages are a year old! Thanks to everyone who's taken the time to stop by and have a look around!

You can e-mail me with your comments, suggestions, and analysis for Electronic T-Notes. If you love gambits, stop by my Chess Kamikaze Home Page and the Yahoo Chess Kamikazes Club.