I am currently working on the import mechanism for Mercury, which sucks up data exported from other blogs in JSON format. (A WordPress XML-format importer is scheduled for later ... it was easier to roll my own JSON export format than figure out that horror.) In doing so, I reaffirmed something I've believed for a long time now: test against the real world, because the data you get from it will beat anything you can come up with.
When blog posts are imported, they're tagged with a key/value object named legacy_id that holds the ID number of the posts from whatever system it was exported from. This way, if you reimport the same data, the system will know better than to blithely import a second copy of it.
What I neglected to do was make sure that legacy_id check was confined only to the scope of the current blog. As a result, I had some imports that were failing, because they had legacy_id data that collided with an import from another blog.
In the end, this was good news. It allowed me to put in checks to prevent that sort of thing from happening.
I think I might well have realized this on my own at some point, but importing real-world data from multiple real-world blogs as a testing procedure helped shake things out all the more quickly.
This is, in fact, a big part of why I wanted to make Mercury my platform for deploying the various blogs I have. The more I had to throw at it, the better -- especially if that "more" consisted of exactly the kind of messy, inconsistent, problematic data that other people are likely to throw at it themselves!