A table's primary key is the field or combination of fields that yield a distinct value for every row in the table. It is hard to overstate the importance of choosing good primary keys when you design a relational database. Notwithstanding the fact that they are required in highly relational systems, primary keys play very important engineering and conceptual roles in any database. A primary key ensures that all the records in your database have a distinct identity. This identity can be used to unambiguously distinguish every piece of information in a database from every other piece of information. Without a primary key, data can become "lost" and lookups may return inconsistent results due to random ordering of data. Integrity rules will easily be violated and information might start leading a double or even triple life in your system. The noble primary key is a vigilant guardian against the spectre of structural chaos.
Many popular database tools make it very easy to create some sort of arbitrary numeric sequence for your primary keys. Be carefull not to use these to excess. If you have are designing a music catalog system, it is definitely not useful to create a lot of arbitrary codes for music classification. There is unlikely to be much benefit in designating "1" to stand in for "Jazz", 2 for "Rock", 3 for "Classical" and 4 for "Latin Jazz" in an application. Why not just use the word "Jazz" for "Jazz" and use "Rock" to stand for "Rock"? It is certainly easier to remember. It also means that you won't have to lookup the value of 1 to see that it stands for "Jazz". Searching for the text string "Jazz" will be only incrementally slower than the search for the number one, but compared to always have to lookup 1 to see that it stands for "Rock" - I mean "Jazz"- is well worth the compromise. Every lookup is expensive and codes add complexity. They should be treated with care. It can only be a symptom of an all too common techno-fetish to assumes that "1" is a better notation for "Jazz" than the word "Jazz". The most common exception to this rule is for commonly accepted codes that are already in place for the specific purpose of the business at hand. International or multi-lingual applications are also cases where numeric abstraction would be appropriate. It is sometimes possible to use existing codes that were not specifically designed for your purposes. These can help provide valuable insight into the problems at hand, but special care should be taken to make sure that the resulting repurposed codes are appropriate for the system. The differences between the original and the repurposed codes should also be clearly articulated.
However, since numeric keys are the most efficient keys to perform computations on, some designers consider key allocation a nuisance and always assign arbitrary numeric keys to satisfy this system requirement. A large number of CASE tools have been known to do this automatically. This simplistic solution seemingly removes the burden of key discovery from the project. Nothing could be further from the truth. Even if you are working strictly within the confines of a CASE tool, it makes sense to find all the candidate primary keys in each of your database tables. The identifiers that can already be found in the data are called natural keys, and are important structural elements of the design. Relying on arbitrary numeric identifiers for all your keys generally increases system complexity. Always using surrogate keys renders opaque important aspects of your data structure. Perhaps when the cost of storage was a significant factor in systems design this practice could be justified. Of course, saving a few bytes was the same justification used to create the Year 2000 problem all those years ago. With the year two thousand just behind us, any argument that numeric keys should be used to save a few bytes of storage or a few processing cycles has been thoroughly discredited.
Practically speaking, surrogate keys are usually appropriate when a table's identity is the sum of more than two or three of its fields. Although most RDBMSs support multiple column indexes, keys and joins, it will be very inefficient to have a primary key that is a composite of many fields. Referring to this key will be cumbersome and not very fast. Also, sometimes there is no combination of "natural" fields to fully distinguish records from one another. Conceivably two different invoices could exist with completely identical information but be two separate invoices. One person could place two separate orders for identical goods at the same time. Surrogate keys are also a good choice when it comes to identifying people. In fact, given the number of different individuals with the same name, the unreliability, confidentiality and coverage considerations with social security numbers, as well as the fact that people often move or change phone numbers often make assigning numeric identifiers to people a necessity.
Indexes can dramatically speed up database query operations. Unfortunately they also slow down all inserts and some updates and consume disk space. If your database table is used mostly for reading and not writing this isn't much of a concern, but if you have a large volume of transactions on a table, overall system performance can often be improved by not building indexes to optimize every possible query. In most databases, indexes can only be built on fields from a single table tables and not on queries or using fields from multiple tables. Recent versions of Oracle and Microsoft SQL Server do actually allow indexes on query views, but only for the more advanced (read expensive) versions of their databases. In general, you can not tell a database explicitly to use an index in the query. The database engine engine optimizer is responsible for figuring out how best to execute your SQL queries. However, some databases like Oracle and MS SQL Server do allow you to give the optimizer some "hint" as to how to perform a query, but this is rarely recommended as anything but a last resort. Effective database admistrators must anticipate which indexes the optimizer is likely to use for common queries. For SQL Server you can also take a peak at what the optimizer is doing with your SQL by viewing SHOW PLAN information using a tool like SQL Query Analyzer. For maximum effectiveness, indexes must also have their statistics updated after large amounts of data modifications. This can be done in Access by repairing and compacting the database. SQL Server has the UPDATE STATISTICS command which can be used on individual objects in the database. Here are a few rough and ready guidelines to help decide which fields to index:
1. Primary keys should always be indexed
2. Foreign keys should always be indexed.
Note that Microsoft Access already will automatically build primary key indexes for you and create foreign key indexes when you define relationships, so make sure you check for duplicate indexes after adjusting your data model. There is also an option in Access97+ to automatically create indexes for fields with names ending in "ID" or "Num".
3. Yes/no fields or fields with no more then 3 or 4 possible values should not be indexed. These types of fields are said to have low selectivity and the database optimizer will most likely ignore them completely and do a tablescan instead. Other fields should only be indexed if they are used in common query expressions. Note that queries using the like '%whatever%' syntax will never use an index and will always force a tablescan.
4. Long text (memo) and Binary (OLE/image) fields can not be indexed so don't knock yourself out trying.
Most SQL databases can not combine indexes for a query. So if you wanted to optimize performance queries on Last Names, First Names and First and Last Names you would have to build three indexes, one First Name index, one Last Name index and one index for criteria using a combination of the First and Last Name fields. One of Access' great innovations gets around this by using Rushmore technology. Rushmore allows Access to optimize multi-field queries even if you haven't created a multi-field index. The increase is dramatic, usually at least 200-300% over a non-Rushmore optimized query. Multi-field indexes are still faster, but given the amount of overhead they introduce, Rushmore allows you to have your cake and eat it to, since it doesn't require any disk space. Rushmore optimization works with Microsoft Access tables, as well as with Microsoft FoxPro and dBASE tables. Rushmore can not optimize Queries to external ODBC data sources, since Microsoft Access' Jet Engine is not doing the main processing.