Probably the most important step in getting your site found in a search engine is the one in which the search engine crawls it. There are things that can be done and things that can be avoided to make this process as painless as possible for the search engine, which will in turn, make it as painless as possible for the webmaster.
Since Google dominates the search market share by such a large market share, it is always a good idea to listen to what they have to say about such matters. So when they post a presentation with tips on optimizing crawling and indexing, you'll probably want to pay attention.
Google has done just that, highlighting things to stay away from, and things you can do to enhance your site's crawlability. Here is that presentation with specific examples of URLs.
"The Internet is a big place; new content is being created all the time," says Google Webmaster Trends Analyst Susan Moskwa. "Google has a finite number of resources, so when faced with the nearly-infinite quantity of content that's available online, Googlebot is only able to find and crawl a percentage of that content. Then, of the content we've crawled, we're only able to index a portion."
"URLs are like the bridges between your website and a search engine's crawler: crawlers need to be able to find and cross those bridges (i.e., find and crawl your URLs) in order to get to your site's content," continues Moskwa. "If your URLs are complicated or redundant, crawlers are going to spend time tracing and retracing their steps; if your URLs are organized and lead directly to distinct content, crawlers can spend their time accessing your content rather than crawling through empty pages, or crawling the same content over and over via different URLs."
If you want to get crawled faster by Google, you should remove user-specific details from URLs. Specifics of this can be viewed in the slideshow. Basically, URL parameters that don't change the content of the page, should be removed and put into a cookie. This will reduce the number of URLs that point to the same content, and speed up crawling.
Google says infinite spaces are a waste of time and bandwidth for all, which is why you should consider taking action when you have calendars that link to infinite numbers of past/future dates with unique URLs, or other paginated data.
Tell Google to ignore pages it can't crawl. This includes things like log-in pages, contact forms, shopping carts, and other pages that require users to perform actions that crawlers can't perform themselves. You can do this with the robots.txt file.
Finally, avoid duplicate content when possible. Google likes to have one URL for each piece of content. They do recognize that this is not always possible though (because of content management systems and what have you), which is why the canonical link element exists to let you specify the preferred URL for a particular piece of content.