On Relations and Relationships

Author: David Kawliche

Introduction

E.F. ("Ted") Codd conceived of his relational model for databases while working at IBM in 1969. Codd's approach took a cue from first-order predicate logic, the basis of a large number of other mathematical systems, and was presented in terms of set theory leaving physical representation and access implementer-defined. In June of 1970, Codd laid down much of his extensive groundwork for the model in his article, "A Relational Model of Data for Large Shared Data Banks" published in the Communications of the ACM, a highly regarded professional journal published by the Association for Computing Machinery. Over the next few years Codd and his relational ideas blazed across the academic computing landscape.

The Rise of SQL and Relational Technology

To support his relational theory Codd developed a language called ALPHA that he often used to communicate ideas in an academic context. However, in 1974 and 1975 Raymond Boyce and Don Chamberlin of IBM designed a new fourth generation language to extract information from systems based on Codd's relational model known as Structured English Query Language, or SEQUEL. This would later be shortened to SQL but is still most correctly pronounced "sequel." For better or worse, SQL has become popularly known as the relational language, much to the chagrin of such luminaries of the database world as Fabian Pascal, C.J. Date and Codd himself. Although SQL certainly is an improvement over many earlier quasi-languages used to perform database queries like CODASYL (which usually required incredibly complicated code to answer even the simplest of questions from a database), it has never been a fully relational or declarative language. Although SQL more or less allows users to specify the results they want rather than having to specify the procedure to obtain the desired results, there are still significant procedural elements in the language. Ideally, a purely declarative relational language would entirely absolve the user from having to figure out the best way to execute the program. As it is, today's SQL databases often show wildly divergent execution times for different expressions of the same logic. Nonetheless, despite its shortcoming, SQL's relative scope and elegance soon drew many converts and is still (incorrectly) considered by many to be basically synonymous with the relational model.

Despite all the excitement it was not until 1979 that the first commercial database product to use SQL was released by Oracle, only two years after its founding. This offering was quickly followed by IBM's SQL/DS product, the forerunner to DB2. By the mid-1980s the relational bandwagon was definitely getting crowded with new companies hawking all sorts of "relational" wares. Not only were DB2 and Oracle significant players in the market but there was also Digital Equipment Corporation's RDB, Relational Technology's Ingres along with a host of other lesser-known products. Codd had extended his model further in his aptly titled paper "Extending the Database Relational Model to Capture More Meaning," published in 1979 in the December issue of the ACM Transactions on Database Systems. However as the marketing departments of commercial database companies increasingly began beating loudly on the relational drum, Codd became increasingly distressed over what he saw as the unfulfilled promise of relational technology. In 1985, Codd, now president of the Relational Institute and with his own consultancy, put forth 12 basic rules plus nine structural, 18 manipulative and all three integrity rules, all of which had to be satisfied for a database to be considered fully relational. More rules would be forthcoming, but Codd assured readers that the current rules would be more than adequate to ensure that a database was "mid-80s" fully relational. Also in this paper Codd clearly demonstrated that no vendor could honestly profess to have a fully relational system. He took the entire industry to task for overstating their conformance to the relational model. He offered a few scathing criticisms of the then current draft of the first ANSI SQL standard as well. In 1989 Codd published his promised revision of the relational model in the book "The Relational Model of Database Management Version 2." Needless to say, most relational database vendors fared even worse in Codd's 1989 relational fidelity tests than they did in his mid-80's tests.

Codd had simplicity as a major objective of his model. Unfortunately, given the depth and complexity of Codd's thought, not to mention the arcane mathematical terms in which he often expressed himself, many of his key points have been widely misunderstood by their practitioners. The author of the article you are now reading readily admits to possessing only a basic understanding of Codd's model even after having spent years as a "relational" database professional. In fact, the technical term "relational" is very often misconstrued by programmers who are often surprised to find out that the relational in relational theory refers to relations and not relationships.

Relational Basics

As noted by Fabian Pascal:

But note carefully that:

To elucidate a simplistic example of this, if you had a company table and an employee table and each company row could have many employee row associated with it, you would (assuming they were correctly designed) have two relational tables and one relationship.

Normalization is the process by which a series of rules known as the normal forms are applied in sequence to a tabular data set that has not been correctly designed in accordance with relational principles. As each rule is applied, the data set achieves higher degrees of normalization. Generally, these rules dictate that redundant information should be moved into new relations (tables). No information is lost in this process, however the number of tables generally increases as the rules are applied. Thus to reconstruct information that once was in one table, a relational query must pull data from more than one table in a join operation.

Relational Reality

Some inexperienced practitioners of relational theory believe that fully normalizing your data will make all data operations more efficient in your average "relational" database. Although it sounds good on paper, this view doesn't take into account the fact that the hype surrounding the so called relational technology in commercial database products have always been more marketing than product. Even the books published by Oracle Press clearly state for the unwary that no major application can be programmed in Third Normal Form and many may face serious performance difficulties if they are normalized at all. Overzealous application of the normal forms is the bane of databases in many enterprises. This does not reflect any particular problem with the relational model per se. Highly relational designs certainly reduce the amount of redundancy in a system and make management of multiple representations of the same information quite straightforward. However they do this at a cost. Despite the many innovations and efficiencies of modern database systems, it still takes longer to bring together information from diverse sources (i.e. related tables) than it does to get all the information you need from a single source, i.e. one big table. Reckless disregard for this reality has been a major contributing factor in a number of spectacular failures in history of Information Technology. Today's databases cannot serve all the complex requirements for information in a diverse enterprise without implementing multiple table structures. In general, the most practical structure for reporting on information from your database is a denormalized one while the most practical structure for ensuring data integrity in a transactional system is very highly normalized. Note that this is no short coming of the relational model itself, rather it is a scathing critique of how poorly the model has been implemented in popular commercial databases.

Although it often isn't pretty, effective management of redundancy, and not its complete abolition, is still the key to engineering effective real world systems using popular commercial technology. Relational theory and the concept of normalization remain the scientific foundation of database management.

This article has been substantially revised based on an earlier critique by Fabian Pascal. You can learn more about relational theory by reading books by Fabian Pascal and C.J. Date.