Have Code, Will Travel

December 10, 2009

Database Standards and Conventions

Filed under: SQL — Phil Sandler @ 4:38 pm
Tags:

I was asked recently to write up a basic set of database standards and methodologies we would be using for an upcoming project.  I tend to use the same standards on every project that I manage, and I am usually the final authority on determining our approach, so there usually isn’t a lot of discussion around this topic.  However, in this case the client wanted some insight into our best practices, so we needed to hash it out and provide it in writing.

As a general matter, I feel strongly that having a set of standards, any standards, is preferable to everyone just doing things in their own way.  I would prefer to have a terrible standard that everyone follows than to not have a standard at all–this goes for all aspects of the code base: the database, application code, javascript, CSS, etc.  Obviously, having a solid standard that is rooted in pragmatism is preferable to one that is thrown together with little or no thought, or is based on an outdated principle (e.g. scarcity of storage space or Hungarian notation). I believe that conforming to a solid standard is one of the keys to developing maintainable software.

Anyway, here are my thoughts on data modeling standards and conventions.  It’s basically just a brain dump at this point, and I am posting it (almost) verbatim from the initial, informal draft, with minor commentary in italics.

Modeling and Design

The logical and physical database modeling will be created up front, and will be built to reflect our best understanding of the data storage/retrieval requirements of the system.

Although the deployed database may be generated using [our ORM’s] schema export tool, the database will be designed and modeled up front, and independently of the application domain model. In other words, we will not design the database to conform with the domain structure, nor design the domain structure to conform with the database—we will develop each independently and use [our ORM tool] to bridge the gap.

This is something I firmly believe it: your data model is at least as important as your domain model, and should reflect the absolute best practices for the storing and querying of the system’s data.  The data model is not an afterthought or an implementation detail—it is one of the core components of your system.

I may do a full post on this specific point, as I have pretty strong opinions on it.  🙂

Naming Conventions

  • Every effort will be made to make each identifier in the database as descriptive and expressive as possible.
  • Unless noted elsewhere, identifiers will not use a type-specific prefix (e.g. tables will not begin with “tbl”, views will not begin with “vw”).
  • All table, column, and view names will be expressed in lower case letters.
  • All table, column, and view names will be singular (“customer” instead of “customers”).
  • Words that make up table, column, and view names will be separated by an underscore (e.g. “customer_status_id”).
  • Words that make up table, column, and view names will not be abbreviated, concatenated, or acronym-based unless necessary.
  • Many-to-many table names will include both related table names, when possible.
  • Where appropriate, column names will be prefixed with their table name (“customer_id” instead of “id”).

I have to admit I don’t have an air-tight reason for this last one.  I like PK/FK names to be the same on both sides, but this is not always possible anyway in cases of a self-referencing table, or multiple FKs to another table.

  • All identifiers will be limited to 30 characters. This rule supersedes the rules regarding expressiveness and abbreviations. [I only include this requirement if there is any chance the system will need to be database platform agnostic.]
  • Indexes will be prefixed with “IX_”.
  • Primary Keys will be prefixed with “PK_”.
  • Foreign Keys will be prefixed with “FK_”.
  • Unique Constraints will be prefixed with “UC_”.

General Design Principles

  • The physical representation of the data model will use physical PK/FK relationships, not logical/metadata relationships.

Seem like a ridiculous thing to have to say?  As a consultant, you will definitely see legacy systems with no foreign key constraints—there are still developers out there who believe that formally enforcing referential integrity is overkill.

  • Each logical entity in the database will be represented by a separate table. There will be no concept of a “general entity” table.

I am not a fan of systems that track relationships between logical entities as metadata.  To me this is a design smell that typifies developing a domain model (based on OOAD/OOP principles) and trying to retrofit it into a data model.  A data model and a class hierarchy are fundamentally different conceptually, and the concerns that each addresses should be considered completely without the context of the other. There are ways of doing inheritance in a data model, and they certainly can work, and there are circumstances where they may be necessary, but a “separate table per logical entity” is always my opening position when starting a new project.

  • Common sense will dictate the up-front creation of basic indexes (e.g. indexes on foreign keys). Performance testing and tuning will determine where other indexes need to be created.
  • Index creation and tuning may be implementation specific; each client implementation may need a different level of tuning based on their data.
  • Database normalization (to 3NF) will be adhered to all aspects of the transactional data model. Normalization principles will only be violated in situations where a performance problem can’t be solved by any other means.

(EDIT: I can’t believe it I did this, as it’s such a key topic for me, but I substituted “4NF” both in this post and in the comments when I really meant “3NF”.  I fixed it above.)

I originally wrote about three paragraphs about normalization, as I have very strong opinions on this topic.  It was mostly ranting about misconceptions about normalization, what it does and doesn’t mean, and (again) how OOAD/OOP principles tend to bleed into database conversations (just because you have a separate address class, doesn’t necessarily mean you need a separate address table!).

Of course, ranting belongs here, not in a client deliverable, so I pared it down to the text shown above.

  • Where possible, a single-column, database-generated surrogate key will be created for each table in lieu of using a natural or composite key.
  • Fine-grained security will be handled in the application layer, and where possible a single database login will be used for application access to the database.

So that’s my list/rant of standards, or at least what I came up with off the top if my head.  If you vehemently disagree with any of the above, feel free to say so in the comments.

Shout it

December 7, 2009

Writing Unit Tests for Data Access Code

Filed under: SQL — Phil Sandler @ 3:18 pm
Tags:

Had a question today from a developer from another software project within my company:

How do I begin and rollback database transactions from a unit test?

My short answer . . .

You don’t.

. . . led to a pointed follow-up question . . .

So how do I write unit tests for my data access code?

. . . which led to a somewhat lengthy discussion.  It’s an interesting topic, and one that I have pretty strong opinions on, so I thought I would write a bit about it here.

Before I start ranting, I want to be clear that what follows is my opinion, and is intended as a discussion of general principles.  Every problem is different, and providing the right solution absolutely requires understanding the full context of that problem.  I don’t, and never would support the idea that there is only one correct approach.  Although I do strongly advocate the use of certain methodologies, technologies and frameworks, I’m all about choosing the right tool for the job.  For example, even though I vastly prefer using an ORM tool over inline SQL or stored procedures, there are times when the latter is actually a better choice.

Where Should Database Transaction Code Live?

So the question behind the original question goes something like:

Should I expose the control of database transactions (start/rollback/commit) to logical layers outside of the Data Access Layer (DAL)?

In a word, no.  I don’t say this because it’s a difficult thing to do (it’s not), or because it will result in a big change to the Data Access API (it won’t).  I say it because exposing fine grained database transaction control outside the DAL dramatically changes the conceptual meaning of the DAL as a logical layer. The DAL is a service/set of services whose sole purpose is to encapsulate all the functionality required to save, delete, and query data to and from the database.  Exposing transaction control outside of this layer violates the encapsulation of your DAL in the worst possible way.

Why?  Because database transactions are one of the foundations on which any database application is built.  Transactions are fundamental in ensuring the integrity and consistency of your data.  I don’t think I’m going out on a limb by saying that the data in your application is by far the most important element of the system; it’s likely the whole reason your application was created in the first place.

So although implementing database start/commit/rollback is generally not complex in terms of lines of code, I would highly recommend as a best practice:

Encapsulate database transaction logic into a well-understood class or set of classes that is accessible only from within the Data Access Layer, and not left open to the wide world and any team member (current or future) who decides to tinker with it.

Then How Do I Create Automated Tests for my Data Access Code?

The question itself contains something of a false premise from my perspective.  Before even attempting to answer, I would want to be sure that the following questions have been answered to my satisfaction:

Do you really need to test your DAL discretely/directly?  If so, why?

It sounds like a basic question, but you might be surprised by the puzzled looks you get when you ask it.  There is a reason why you want to test your DAL, so what is it?  What, exactly are you trying to protect against?

I can think of a few answers that would satisfy me that some form of automated, discrete Data Access testing would need to be done:

  • The DAL is going to be refactored and needs some basic tests to ensure it was done correctly, and a full regression test on the system is not possible for whatever reason.
  • The system has some incredibly complex edge cases for data that need to be tested, and would be extremely difficult to set up for manual/integration/QA testing.  The testing would be limited to these cases.
  • The software is very data-centric, so our discrete DAL testing essentially is our integration/system testing.
  • Uncertainty in the parity between ORM representation of your database, and your actual database schema (whether it be LLBL generated code or NHibernate configs).

So assuming it turns out that, yes, you need to create discrete tests against your DAL, the question becomes:

How do I create automated tests for my Data Access code without exposing database transaction control outside of the DAL?

If you’re going to be testing the DAL discretely, it’s a given that you’ll be doing it against known data.  If setting up predictable, known data in advance (e.g. via a restore of a reference database) is not an option, that means you will be creating the data from within your testing code, either as part of each discrete test or as a setup method that gets hit at the start of each test run.  This data can clutter your dev/test database over time, and become huge and unwieldy.  Dealing with this data really becomes the core of the question.  I think the options are as follows, in order of preference:

Option 1: let the data build up, and don’t worry about it.  It’s a test database, so filling it up with meaningless data should not have any real impact.  If the data created by these tests really is problematic (e.g. getting in the way of manual unit testing, demoing, or QA testing), create a separate database specifically for your DAL tests.  In the unlikely event that even this separate database gets out of hand over time, you can simply re-create it from scratch periodically.

Option 2: track the data you create and delete it in your teardown method.  If you can create the data in your setup method, what’s stopping you from deleting it in your teardown method?  This will only work of your Data Access API supports deleting the type of data you are working with; I would not recommend adding delete functionality to your API for the sole purpose of testing.

Option 3: mark the created data in such a way that it can be easily deleted.  I have taken this approach in the past—basically when I created data that was to be used in discrete automated tests, I would prefix and/or suffix key areas with a known string of characters, like “***TESTDATA***“.  This made it easy to find and purge the data later by deleting it directly via a specially-created stored proc or ad-hoc query.  This cleanup code could be run manually when needed, via a scheduled process, or as part of the build.

Option 4: use your unit testing framework’s Rollback support to wrap your test in a COM+ transaction.  This can be tricky to set up, depending on the database you are using and the security environment of your servers.  But if you can get past the initial setup and get it working both locally and on your build server, this may actually be the best and easiest option.  MbUnit supports this functionality out of the box (MbUnit RollbackAttribute), and I believe NUnit supports it via an add-on (Google "NUnit Rollback").  I’m sure other frameworks support it in one form or another as well.

Conclusion

Use of automated testing should be part of an overall testing/QA strategy.  It is a great tool for your tool belt, and one that I recommend and use extensively on my own projects.  But it is a means to an end, not an end in itself.  You can have 100%  coverage of your application code and still have software that doesn’t work, and you can have 0% coverage and have software that is completely bug-free.

Also note that when I say “coverage” here, I mean automated coverage (as measured by a tool such as NCover).  Overall test coverage really includes a number of factors, including:

  • Automated unit testing
  • Manual unit testing
  • Integration testing (automated or manual)
  • QA/System/Regression testing (automated or manual)

Consider the possibility that your DAL is being tested implicitly by some other aspect of your overall testing strategy, and that discrete, automated testing is not necessary.

Shout it

Create a free website or blog at WordPress.com.