Metadata Madness

Over the last year my library has been working on implementing a discovery product for our community hospitals’ website and it has been quite an adventure.

We wanted to create a better website to help unlock the siloed information that our library subscribes to. Library users have no clue that Hurst’s the Heart, is only available electronically via McGraw Hill. They could check the catalog, but they don’t. They go on to the library website and type the title in the search box. Now, as librarians, we know that unless you have a discovery system for that search box, the results come from the content on the library website, not within the resources listed on the website. Hurst’s the Heart is not on the library’s website, it is within the McGraw Hill website. I used this ebook as an example, but the same thing is true with other ebooks from other vendors, ejournals, PubMed articles, etc. People try and search the library’s website like it is Google and expect to get results from PubMed or elsewhere.

Librarians have curated and organized their little hearts out trying to make things easily found and navigated on the library website. But library websites seem to be a mystery to users. *Confession* I am a librarian and I am sometimes confused trying to find information on my public library’s site. Like it or not, the users (even savvy ones) expect a Google like experience.

In order to provide this type of search and retrieving ability on the library website, we decided to implement a discovery system. In theory, a user would type in the words heart attack and all of the library resources on heart attacks would be displayed. You would see the ebooks that have heart attack as a topic or chapter, PubMed and CINAHL articles on heart attack, UpToDate or Dynamed results on heart attack, etc. Now, heart attack is a simple search that would yield a lot of stuff, but you get the point. The discovery system would crawl through the library resources and find the items relevant to the search. Thus, unlocking the resources within the silos to be seen on one site, the library site.

However, in order to do this, the library resources MUST HAVE METADATA!!! I know that is a wild concept….the ebooks, articles, documents, videos, etc. all need metadata. The sad, strange truth is that library resources and library vendors have strayed away from good metadata.

Here are some examples:

– OCLC catalog records need EXTENSIVE cleaning and improving before they can go into a catalog. I sit on a large state wide consortium and the people who deal with loading the OCLC records continually lament that OCLC cataloging has really slipped.

– We were informed ProQuest Safari Textbooks MARC records will no longer have subject headings and lack important information for retrieval within the catalog. Incidentally I was told ProQuest was told their new subject-less records were fine by OCLC.

– ProQuest isn’t the only ebook problem vendor. McGraw Hill, ClinicalKey, and other publishers have crappy records as well. They are missing subjects (I am not even talking about MeSH…that is totally not there), authors, editors, chapters titles, etc.

Things get even worse when we start looking at videos and images. Good quality images or videos can be difficult to find, and think of all of the images and videos that are in our multi-resource platforms.

Looking to the future I get even more concerned. Have you done an “up to the minute” covid-19 literature search recently? If you have, then you will know much of the research out there exists in pre-print. Pre-print is the wild west with metadata. To be clear I am not expecting any sort of indexing like you would find on a MEDLINE record. But depending on the item and the database, titles can be incorrect, authors missing, and the data from the abstract or full text is missing. You may not notice this at first glance, because the title, authors, and abstract/full text are on the screen and can be seen by your eyes BUT try and load that sucker into EndNote and you will see a blank record and “missing data.” This happens a lot with medRxiv records. Another example, you discover a good keyword in the abstract you hadn’t thought of, and you search that word in the database but don’t retrieve the same article you discovered it from. That is because the metadata for that article is missing.

The National Library of Medicine wants to make data sets retrievable and there are a lot of other places that you can search for data sets. IMHO searching for data sets is only slightly worse than searching the pre-print literature. The metadata indicating what data is searchable is all over the place and there are few standards on what metadata is required for databases or search engines to search.

In addition to missing metadata, we have the problem of metadata hoarding. Depending the company you purchase your discovery product from, the product may not do a great job of searching (if it searches at all) other library companies’ products. For example EBSCO’s Discovery does a great job discovering EBSCO products like CINAHL and DynaMed, but it has problems discovering ebooks in ClinicalKey and ProQuest things. Same with ProQuest’s Summon, it does awesome with ProQuest things but falls short with EBSCO and ClinicalKey things. Library vendors are not making their metadata equally available to discovery product vendors. I can sort of understand the reasoning if you have companies with competing discovery products. I may not like it, but I get it. But I don’t understand why ebook and other companies with no discovery product aren’t tripping over themselves to make their metadata available to discovery vendors.

We noticed this metadata hoarding problem when trying to implement our discovery product. What was our duct tape solution? We loaded all of the ebook records for those non-participatory vendors into our catalog and made our catalog discoverable. Now why did I say that was a duct tape solution? Because, the batch catalog records from these companies are often bad, missing key things like subjects, they lack chapter information, and they sure has heck don’t have the full text. The discovery system can only “discover” the ebook from the information within the bad catalog record. So if there is a book with the most perfect chapter on your topic but a poor record, the discovery system will never show the book. *Note* Not all catalog systems will work with a discovery system. The duct tape method I described won’t work if your catalog isn’t compatible. Also, we clean up the poor batch records, so eventually the records are improved, but it isn’t like having good metadata to the full text.

Why am I going on a missing metadata rant?

As more and more things are online and library collections are not browsable on websites, the metadata of these resources becomes essential to their discoverability and usability. If someone does a search and an item can’t be found, it won’t get used. If it doesn’t get used, it gets dropped. Usage stats are paramount to getting resources renewed. Sucky usage stats, and we drop a product like a hot potato. I don’t have the budget to babysit a potentially good tool that isn’t getting used. I have a long wish list of items to take its place in the budget.

We also need the metadata to improve because not everyone uses a library. A lot of researchers use PubMed, Google Scholar, medRxiv, data repositories, and video libraries, for their scholarly purposes. NLM is great for published articles, but has fallen behind in the non-print arena. Publishers and other companies are creating their own metadata “standards” which are far from standard and agreed upon. Due to poor metadata and lack of leadership in metadata standards, we risk having a generation of information lost in a quagmire half done metadata never to be retrieved.

One thought on “Metadata Madness”

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.