Improving Search Results Relevance with Google Site Search

One of the challenges we have at Marshall University with the site search results provided by our Google Custom Search engine is trying to figure out ways to bubble the right content to the top of the results.

Normally, maintaining relevant search results isn’t that difficult, because as you change or delete pages on your site, the various crawlers update results for pages that have changed, and they simply expire and remove pages that are no longer present.  Usually this happens pretty quickly, and there’s no need to do any sort of manual intervention.

At Marshall we have a couple of semi-unique problems that present some real challenges to search result relevancy.  Let’s talk briefly about each of these problems, and what steps can be taken to mitigate the effect each have on your site search results.

Our First Problem – Content that Shouldn’t Still Exist

Dealing with this problem can be challenging if you have a lot different site owners or content creators.  At last count, we have about 375 different groups or departments that have independently maintained sites.

Ideally, a site administrator will keep their site up-to-date, and remove irrelevant or outdated information as necessary.  When you’re working with a larger group of editors, though, sometimes things fall through the cracks or this simply doesn’t happen.

Because we weren’t able to reliably depend on each site to expire their own content when it fell out of date, we had to take some steps to make sure that pages that looked like they might be abandoned could be removed from the search results without completely removing the page itself (in the event that the lack of update over time was simply an oversight).

To make this feasible on a site the size of Marshall’s, some automation has to come in to play. In this area, I was fortunate to be able to enlist the assistance of Chris McComas, one of our Systems Integration Specialists. Chris created a small Python script that runs on a regular schedule. That script examines the last modified header of each page on a site.  If the page last modified date is older than whatever value we’re using at the time (right now it’s looking for pages older than one year), we record the URL of the page and add it to the exclusion list of our robots.txt file.   This allows the page and its content to remain untouched, but as crawlers update their indices of our site content, they will be removed from our site search results.

Pros of this approach

The biggest plus to having an automated process that can examine and expire content is the amount of time it saves the Enterprise Applications team.   This is all work that, absent some sort of automation, someone would have to perform manually.  Automating it in this way saves a ton of time, and the return on investment is pretty good.

Cons of this approach

The two biggest problems with this approach actually also relate to the fact that it’s an automated process.  The first problem is that most, but not all, pages on our site have accurate last modified date headers.   If they don’t, though, this script is going to ignore them and even if it’s something that shouldn’t be hanging around any longer, this script won’t take care of it.  The second problem with this approach is that there will inevitably be some content that may not have been updated in the last year, but which still probably shouldn’t be expired.   Think about a site that provides a history of the university or similar semi-static site content.   Those types of sites are unlikely to be regularly updated once they’re built – but they still should remain accessible in the search index if they’re current and active sites.   In these types of situations, we’re currently tweaking the scripting process to provide manual exclusions for sites that require it.  This is obviously a bit of trial-and-error, because we’re unlikely to know a site needs an exclusion until someone reports that it’s missing from search and shouldn’t be.

It’s also worth pointing out here that there are some problems that will be difficult to automate your way around.  The site owners or content editors at some point must take responsibility for a basic understanding of site indexing and managing what they keep online, because otherwise it’s nearly impossible to correct their problems.

A good example of this would be a site like our Human Resources site which is heavily weighted toward downloadable PDF or Office Document format forms.

Suppose that the HR site editor puts a link on their Forms page to a form form that they have authorizing payroll deductions.  If they subsequently replace that form with an updated copy, the inclination of an inexperienced editor might be to remove the link to the old form and replace it with a link to the new copy.

Doing things this way without also removing the original copy from their site leaves lots of potential for problems.  Suppose you have a PDF that contains the title “PEIA Benefits Explanation” in the document title or title content?   If Human Resources were to add this PDF to their site each year, and rather than removing the older content simply linked to the newer PDF, within a few years you’d have several PDFs returning for the search “PEIA Benefits Explanation”, and the end user would have a really tough time figuring out which one they should be looking at.

It’s important to remind your site editors or content creators that unlinking a document or web page does not equal removing it from search results.  Make sure that there is an understanding that if they put something on the web and don’t explicitly remove it, they should assume search engines can still see it.

Our Second Problem – Content that is confusing or duplicated

Before diving in to this problem, I should note that I don’t have any easy solutions for solving this one. It simply requires that you methodically work through the content on your site, excluding certain content, subtly renaming other content, and perhaps even removing and relinking content in other places.

Duplicate content is a big problem on the Marshall University site, and one that we’re working hard to solve, but it’s also one that due to the nature of the problem is pretty difficult to chase down.  Let me give you a couple specific examples.

The screenshot you see below is a sampling of the results you would see if you performed a search for “nomination form” on our site.

 

site_marshall_edu__nomination_form__-_Google_Search

 

The big problem here is that while all of these might be “nomination forms” of one kind or another, if someone comes to the site searching for that term, it’s going to be difficult for them to sort through multiple results like this to find what it is they’re actually looking for.

This is again a place where you’ll need the help of your content editors to adequately solve the problem.  In general, a site editor should try to name their subpages, forms, and documents in a way that makes it clear to the visitor to the Marshall site who that content is for.

It’s just *slightly* more work to identify your form as the “Center for Teaching and Learning Nomination Form” that it is to just call it “Nomination Form”, but it’s going to help produce an infinitely better experience for your end users.

Another Example: Reproducing Instead of Linking

One way that unnecessary duplicate content ends up in your search index in the first place is by content editors that reproduce existing content rather than linking to the content they’re referencing at the authoritative source for that content.

Have a look at this example:

site_marshall_edu__leave_request_form__-_Google_Search

This is a screenshot of the search results you would see if you were to perform a search on the Marshall site for the term “leave request form”.   Here you see five different search results, on five different sites/urls, yet we’re only actually seeing three different forms here.

Additionally, the “Leave Request Form” you see at the top of the results is a form provided on the Research Corporation site for their employees, and it’s different from the one that Marshall employees would use.  One thing we could do to improve this would be to rename the MURC form to “MURC Leave Request Form”.  In that way, it would be clear to the user what, exactly, they were clicking on when they downloaded that PDF.   It’s worse when you consider that the MURC form and the main University form are not only similarly named, but they look similar as well – so it is easy to understand why a visitor might download, fill out, and turn in the incorrect version of the form.

The second problem that we see here is that our second search result “Staff Leave Request Form” is the correct form for most employees, but it’s hosted on the School of Pharmacy website.  Why?  Rather than linking to the leave request form on the Human Resources site (the department that should be the authoritative source of content for such forms), the site editor has instead downloaded a copy of the form, uploaded it to their site, and then included that copy on their site for their users.   Now we have two different versions of the same PDF, in two different locations across two websites, occupying search result space.

Site editors should never download content from another site, reupload it to their own sites, and then link to the version on their site.  The right answer is to always link to the original source document on the site that would be authoritative for that resource.

Reasons Why Not Listening to This Advice is Bad

It doesn’t take much imagination to come up with a scenario where the official University leave request form is changed in the Human Resources office (where logically that type of change should/would occur), yet the site editor of the Pharmacy site doesn’t hear about that change. Now instead of the small problem of the same leave request forms across two sites, we’d have the bigger problem of an outdated and now inaccurate version of the Leave Request Form showing up as the top result of our searches.

Ok, you get it, but what can you do to fix some of this without the site editors help?

One thing that you can do to improve the accuracy of your search results if you’re seeing this type of problem, is to take advantage of Google Custom Search engine promotions to create a curated set of highlighted results when users are having trouble getting to the right version of a certain type of content.

Promotions can also help if there are terms on your site that create confusion for users.  A great example of this on the Marshall site is a search for the term “financial aid”.   Left completely alone and to the mercy of indexing crawlers, there would be very little returned on our site if a visitor searched for the term “financial aid”.   Why?  Because we do not have a Financial Aid office.  We have an office of “Student Financial Assistance”.    This type of unique department naming can be cute and can help differentiate certain departments or programs, but as a site administrator you have to consider the way that users are likely to search for information.

Most users (myself included) are going to type “financial aid” in a search box when we need information about financial aid, regardless of what an office has decided to actually name themselves.

Synonyms in the GCSE interface allow you to set up pointers that can for example associate the term “financial aid” with your Office of Student Financial Assistance.

Using synonyms and promotions in GCSE, we can now create a curated result for the search “Financial Aid” on Marshall’s site.

Marshall_University

 

The top result that you see is our curated, and highlighted result for this search term.   By combining the ability to create these curated results with the ability to regularly review the top search terms on your site, you can begin to create a specific list of curated results for some of the top searches your users are performing.  You can see this on Marshall’s site by searching for terms like “transcript”, “exam schedule”, “directions”, and many others.  Each of these terms was at one time or another a top search that was having difficulty surfacing the correct content.  Through these curated results, you can help your users get to where it is they actually want to go even while you work with site editors to clean up their content.

To learn more about how to use the GCSE promotion and synonym interface, follow this link to Google.

Suggesting Search Terms

Another thing that you can do to help your users find the right content is taking advantage of the ability of Google Custom Search Engine to parse your most popular queries over time, along with any promotions or refinements you create to build an autocomplete function for your site search box.

Marshall_University

Here you see the results of our site autocomplete when I begin typing the letters ‘tr’ in to the search box. These autocompletion suggestions come primarily from three areas. The popular terms recent visitors have searched for using that combination of letters, the promotions and refinements that have been set up in GCSE, and the latest index of our site content.

It’s a small feature, and many users won’t notice it, but when it surfaces exactly what is being searched for before the user has to hit return, it’s a very nice experience.  It’s also a very low-weight way to add some simple user enhancing functionality to your site.

Enhancing Your Site 404 Error Pages

It’s useful to remember that relevant search results and 404 errors do go hand in hand.  The more aggressively you expire site content, the more likely it is that you’ll end up generating 404 errors for at least some users before that expired content gets completely removed from site indexes.

If a user has to encounter a 404 error on your site, it’s best to try to make them as useful and informative as they can be.  Some may never read the text of the page, but at least you’ll have the satisfaction of knowing that you gave the end user as many options as possible to find what they are looking for.

Let’s take a look at the 404 page we’re using now at Marshall, and some of the ways it tries to guide a user to the content they need when they encounter this page.

Try visiting the URL http://www.marshall.edu/johnc to get a complete overview of what our 404 page looks like.

Toward the bottom of the page, you’ll see that the user is provided another search box, as well as a link to each page of the site index (by letter).  That’s nice, but it’s not really all that useful.  There is a search box on top of the page, and the site index by letter is a curated list of sites, so what they want may or may not be listed there.

Sorry__the_Marshall_University_webpage_that_you_were_looking_for_cannot_be_found_

Rather than the URL marshall.edu/johnc, try the URL http://www.marshall.edu/registrarr, which is an intentional misspelling of an existing site to demonstrate this functionality.  Now, things get a little more helpful.

Sorry__the_Marshall_University_webpage_that_you_were_looking_for_cannot_be_found_

Now, rather than simply giving the visitor an index list of sites by letter, we are actually noticing that there is something on our server that’s pretty close to what they typed, and making the assumption that it was close enough to offer the actual URL of the registrar site as a suggestion for them.  That’s a little more helpful.

What if it was a site that used to be on the server, but ended up being removed on purpose for one reason or another?

Using a nice JSON web service provided by The Internet Archive, we can check their archives to see if this URL that the user is requesting was ever snapshotted and placed in the archive index.  If so, we can retrieve the link for that snapshot, and let the user know that while the site isn’t here anymore, we realize that, it’s not a mistake, and if they still need to see the content there is an archive available.   You can see this by going to http://www.marshall.edu/itvs, a site that used to be online and has subsequently been removed.

Sorry__the_Marshall_University_webpage_that_you_were_looking_for_cannot_be_found_Examining Your Logs and Analytical Data Regularly

The last and potentially most helpful thing that you can do to help clean up search relevancy issues on large content or multi-editor sites is to pay close attention to your logs and other data sources. It’s important to make sure that you’re using more than one source of  analytics data.   This helps you validate one against the other, but it can also help you see patterns in one that may not be readily apparent in another.

A great example of this on the Marshall.edu domain is the Career Services site.  While Google Analytics won’t give you specific request uri data for 404 pages, Webmaster Tools combined with IIS logs give you a much cleaner picture.

If I had been relying solely on Google Analytics, I would have never noticed that a large percentage of users attempting to reach our Career Services site at http://www.marshall.edu/career-services/ were seeing 404 errors because they were either typing marshall.edu/careerservices, or marshall.edu/carreer-services (note the additional r).

As I showed you above, the 404 page would have noticed that there was a site similar to both of these at the /career-services URl and suggested it, but it would be a better user experience if that could all happen in the background.

With the data that these disparate sources of analytics provided me, I was able to find out about the Career Services 404 issue, and put rules in place that simply automatically transfers the user to their destination if they make either of those common mistakes, rather than asking them to click through to another link.

It’s Going to Be a Marathon

If you’ve gotten to this point in this post, it means you’re probably facing some of the same challenges that we’ve had to deal with.  The final piece of advice I can offer is simply, be patient.  Improving search relevancy and expiring outdated content aren’t sexy projects that are going to be bullet points on a year end highlight slide – but they really do have to be worked on if they’re ever going to improve.

You’re also probably going to have to do a lot of user education.  The average user isn’t going to care that there are 40 documents on 38 sites called “Rate Schedule”.  They’re only going to care that they can’t find the rate schedule.   Listen to the frustration behind the complaint rather than just the complaint itself and sometimes you’re able to come up with a creative stop-gap solution that can solve the users problem in another way.

It’s also helpful to help site editors understand that as users of the site as well, it’s to their benefit to help properly link to, name, and refine content.   It’s a team created problem that will only ever be truly solved by a team created solution.

 

2 thoughts on “Improving Search Results Relevance with Google Site Search

  1. Does the fact that the result set changes at random bother/frustrate you or your users? For example if you go to marshall.edu and type “past, present and future of Marshall University’ (without the quotes) into the search box, the result set bounces from 4780 to 11800 seemingly at random every time you hit refresh. If you quote the same string you get exactly one result, an HTML page, even though the phrase occurs in a PDF of the same content and that PDF will show up in the unquoted search, since the PDF is indexed? Our content editors are constantly frustrated by those kinds of actions with Google Custom Search on our site, just wondered if you all have had comments like that.

    1. @Dan – the free CSE is definitely not bulletproof. I know the result numbering issue that you’re speaking about – I’ve encountered it myself before. There’s a discussion thread where others are having the same/similar issues, and the general consensus is that it’s a bug in the paging system that GCSE uses when rendering the result sets (https://productforums.google.com/forum/#!topic/customsearch/TZxzt7J2XYU)

      There are a few other issues like that with the custom search engine that can be maddening. We considered for a time investing in a search appliance to drive site search results. I also experimented with using our SharePoint farm to handle the indexing and serving of results. Neither of those proposed solutions approached the ease of use, configuration, or comprehensiveness that the CSE offers – so I’ve (and for the most part the users) learned to live with the limitations.

      I’d love to say that I think Google will fix these sorts of buggy issues/limitations – but as it’s not a profit center for them, and in the case of the embedded education versions they also aren’t serving ads – I doubt it’ll get much attention.

Leave a Reply to John Cummings Cancel reply

Your email address will not be published. Required fields are marked *

Connect with Facebook

nine − five =

*