by Steve Lopez

I think I finally have the kinks worked out of Electronic T-Notes and Battle Royale and have the pages formatted the way I want. I'd like to thank you for your patience while I muddled though the process.

I'd also like to thank everyone who has written to me with questions and suggestions for ETN. I plan to cover all the topics that have been suggested, but it will literally take months to get to everything. If you don't see your suggestion or question covered in the next few weeks, don't lose heart -- I will get to it!



by Steve Lopez

In the last Electronic T-Notes, we discovered that chess games are available from a wide variety of sources, but not all of the data is as accurate as we would like it to be. This week we'll look at ways to prevent inaccurate data from polluting our data libraries.

The first step in the process is to ask yourself a tough question:

What data sources do I trust?

Every action you perform in managing your data revolves around the answer to that question, so consider it carefully. Data is available from so many sources; which ones do you trust? Who do you think supplies the most accurate data? Whose header information requires the least amount of work to standardize it? When it comes time to separate the wheat from the chaff, whose data stays and whose goes? These questions are the guideposts you need to follow as you consider your answer.

First comes the bad news. The only way possible to know with 100% certainty that a gamescore from any source is correct is if you actually saw the particular game played with your own eyes. Period. Otherwise you have to trust some intermediary party along the way to accurately write down and transcribe the moves.

Accidents can happen. Errors are made. Have you ever written down a wrong move or accidently omitted a move from your copy of the gamescore while playing in a tournament? I've collected gamescores from a couple of events that I've run and I've had games that were impossible to follow because the players' scoresheets didn't match each other, even to the point that I wondered if they were actually playing the same game!

Even published sources can contain errors. Here's a famous example. Have you ever played through the game from Alekhine's My Best Games of Chess in which the two players had five Queens on the board between them? It's a classic game and absolutely fascinating. There's only one problem with the gamescore as given in the book: it never happened.

I hate to burst anyone's bubble with that bit of information, but it's true. That game is mostly fabricated. The header information is correct; Alekhine did face that particular player in that event. But the famous position with five Queens on the board came straight out of Alekhine's imagination. He later admitted in a letter that as he analyzed the somewhat more pedestrian game he came across the fascinating possibility of a variation in which five Queens would be on the board simultaneously. He was so taken with the notion that he presented the variation as the actual gamescore.

Unfortunately, the true moves were never presented in later editions of Alekhine's My Best Games of Chess. The game as published has since been accepted as gospel, printed and reprinted scores of times, and now exists in millions of databases with no hope of future correction.

So if even the greats of the past sometimes mislead us, how do we know our data is accurate today? As I stated, the only way to positively know is to take the John Locke approach: if you didn't see it played yourself, don't trust it. I'll warn you, however, that this will lead to ownership of very small databases. At some point, you have to trust somebody.

Like you, I tend to gather data from a variety of sources. After three years of access to online services, I've developed a personal hierarchy of sources in order of accuracy, trustworthiness, and ease of integration into existing data, a hierarchy which I'll share with you. I'll list them in order of the ones I trust most to the ones I trust least, with appropriate comments added:

1) ChessBase data. Yes, I'm partisan as all get out. But the fact is that I know that ChessBase data contains standardized game headers and is checked for accuracy. The data isn't perfect and errors do creep in from time to time. However, the data available from ChessBase is 99.x% accurate, labelled as being from a ChessBase data disk (the disk name is given as part of the "Annotator" field), and ready to be merged into existing databases with minimal hassle.

2) Data from other commercial sources with which I am familiar. If I know the guys who are selling the data, if I know they're checking the data for mistakes, standardizing the headers, and generally trying to do a good job of presenting the data, then I'll tend to trust them.

3) The Week in Chess. Over one hundred issues and still going strong, Mark Crowther's weekly chess magazine is the on-line chess publication. The games provided in TWIC are pretty accurate, as the contributors are generally people who are actually on the scene as the tournaments are played. If errors are later spotted, they are corrected in future TWIC issues (which earns big points on my "trustworthiness" scale).

4) CompuServe Chess Forum. We're starting to enter the "twilight zone" now. I'm not saying that everything in the CompuServe Chess Forum's library is 100% accurate, but I know that the contributors are frequently people "on the scene" (like Vadim Kaminsky) and that errors are corrected if they are later discovered. There are still risks, however. I once downloaded a database purporting to be over 1700 Smith-Morra Gambit games, only to discover that almost half of them were of other Sicilian variations. Caveat emptor. I will say, though, that CServe's Chess Forum was the best chess club I ever belonged to, and that the level of discussion there is several orders of magnitude higher than that exhibited in the* hierarchy on Usenet. I miss it terribly.

5) Data from other commercial sources with which I am not familiar. If I don't know the guys gathering or editing the data, or if the data looks unedited (randomized header standards, etc.), then I have to lump it here. I'll try to use it (since I paid for it), but I'll always have suspicions.

6) The University of Pittsburgh ftp site and other Internet chess libraries. The Pitt site is a great source of games, but (as we saw in the last Electronic T-Notes) the standards are spotty, to say the least. While errors are sometimes (read "rarely") corrected, it's still not a terribly trustworthy site. Games are found to be wrong, headers are found to be wrong, and since just about anything uploaded tends to get accepted, you may wind up with a lot of amateurs' games in your database. Or, as my friend Randy Foreman said to me last week (in one of his intensely humorous e-mails): "At Pitt, I found [the] Asian Girls Under-12 championship. I really needed that in my collection".

7) Everybody else. There are tons of web pages that include links to databases, mostly in PGN format. Who really knows what you're getting here? Is it trustworthy? Is it accurate? Is it even played by the people whose names are in the headers? Heck, do the links even work? You pays your money, you takes your chances...

As I stated previously, I use data from multiple sources. But I tend to trust the first three on this list more than I trust any of the others.

Once you've determined who you trust, the next step is to determine how you're going to use that data. How do you utilize it while managing to keep it from corrupting your overall data pool?

The keys here are twofold: segregation and careful labelling.

The first step is segregation (please, no references to the Supreme Court here). I tend to separate my data according to source. As for ChessBase data, ChessBase Magazines and Expresses go into one database, Informants into another, Correspondence Chess Yearbooks into another, and so on. Games from The Week In Chess go into another database. Games from the Pitt site go into another database. I'd continue, but you get the idea. Keeping them separated by source makes it easier to find and surgically remove problem data at a later time.

I first suggested the approach of multiple databases back in 1993 and had a few people kick because ChessBase 4 allowed users to search just one database at a time. But the Windows versions of ChessBase allow searches across muliple databases in a single pass, so there's no reason not to adopt this approach. The only argument I can think of against this approach is the one of duplicate games taking up extra disk space. However, if hard drive or archival space is that limited, I question the concept of the user trying to establish a huge database in the first place.

The second key I mentioned is careful labelling. Whenever I get data from an on-line source, I use DBEdit to add a reference to the source to the header information. I just open the database, hit [SHIFT-F4], type in the additional header info, hit [ENTER], and let it rock. For example, every game I get from The Week In Chess has "[TWIC]" added to the "Source" field.

Why is this important? Because you're not going to be using your data in a lab-sterile environment. If you're going to study the Budapest Defense in-depth, you're going to search all of your databases for ECO codes A51 and A52 in a single pass. You're not going to search for and play through all of your ChessBase data, then all of your Week In Chess data, then all of your Pitt data, etc. in three or more separate passes.

Once your search is done, you'll notice duplicates in the search results. You might find a recent Ivanchuk-Kramnik game three times in the list. Labelling the data lets you know which games to eliminate and which one to keep. For example, if you see three unannotated versions of the game, one labelled "CBM", one labelled "TWIC", and one labelled "pitt", you'll know to eliminate the latter two.

The hierarchy of data I described earlier comes into play here, as well. If I have three versions of a game and all disagree in some way (for example, two have the same number of moves but a difference in transpositional move orders, while a third agrees with one of the first two in move order but has an extra move at the end of the game), I'll have an idea which one to keep and which two to delete. In my own databases, ChessBase games always get the preference for inclusion after a search, while games from the Pitt site are likely candidates for deletion.

And the hierarchy is a valuable tool for viewing games as well. If you're playing through all of the games of a certain Budapest line and suddenly see what looks to be a novelty, stop and check the game header. If the player is one you've heard of, with a high rating, and you see an abbreviation for Informant in the game header, chances are that it's really an important theoretical novelty. But if the player is somebody you haven't heard of, with no rating given, and "pitt" in the header, the "novelty" may be suspect. It might actually be a strong move, stumbled on by accident by a nine-year old Malaysian girl in that database Randy downloaded from Pitt. But I'm more apt to spend time analyzing the position and questioning the move in that case than I will if the player's name is "Polgar" and I got the game from an Informant volume on disk.

So establishing a hierarchy of preference, data segregation, and data labelling are the keys to database management when dealing with games from multiple sources.

Of course, one could challenge the entire "huge database" argument in the first place...

My pal Jon Edwards, the current U.S. correspondence chess champion, and I had this discussion a few years ago. Jon is of the opinion that any game played by any player, regardless of rating, is a valuable addition to a database. If nothing else, a game played by two unrated sixth-graders can tell you how not to play a particular opening or position, in Jon's opinion.

I countered with the idea that Jon is a strong enough player to recognize these mistakes (especially the subtle ones) while the average rank and file player won't know a "good" game from a "bad" game without Elo ratings being presented in the headers. Inclusion of "bad" games by weak players in a database might actually be harmful to the average player, by being misleading and possibly reinforcing misconceptions and bad habits.

It's four years later, and Jon and I still differ on the subject. It's not a question of right and wrong, it's a question of relative strength and approach. To Jon (a very strong player), every chess game has merit, even if it's to show how not to play chess. To me (a.k.a. Joe Fish), a game annotated by a grandmaster is worth 5 unannotated GM games and at least 50 unannotated games by class-level players. I can learn from all of these games, but the learning curve will be much less steep with a talented player guiding me through it.

I presently have database access to over two million chess games in my personal collection. I will never play through the vast majority of them. As a writer, I find these games to be a valuable reference library. As a player, I find most of these games to just be a burden. I don't even view all of the games in the openings I actually play, much less the thousands of other games in the collection. I just play through the annotated stuff as I get it and skim through the rest.

But you aren't me and nobody expects you to adopt my approach. I'm just giving you some food for thought. Is a huge database really valuable to the average player? Are smaller databases with lots of annotated material more valuable? Is it worth the time and trouble for the average player to perform the database management that huge data pools require?

If your answer to that last question was "yes", I hope that I've provided you with some helpful tips on keeping bad data from corrupting your pool and ruining your game.

In any event, I hope I've given you some things to think about. Easy access to tons of chess games on-line is a mixed blessing. It's like the old comic book superhero cliche: "with great power comes great responsibility". Having two million chess games on various CDs can be a good thing or a bad thing; it all depends on how you use the data. At the very least, this electronic easy access forces you to question data reliability and decide what you're going to do about it.

As for me, I'll decide what to do with Horrorbase as soon as I stop laughing at the obvious errors in the game headers...

I'm very interested in reading your opinions of this (or any other) issue of ElectronicT-Notes, as well as my book Battle Royale. I invite you to post comments to our ChessBase Users Group or else you can e-mail me directly.