Chris Hostetter is Senior Staff Engineer at Lucid Imagination, a member of the Apache Software Foundation, and serves as a committer on the Apache Lucene/Solr Projects. Prior to joining Lucid Imagination in 2010 to work full time on Solr development, he spent 11 years as a Principal Software Engineer for CNET Networks thinking about searching “structured data” that was never as structured as it should have been. Chris has posted 15 posts at DZone. You can read more from them at their website. View Full User Profile

Solr Powered ISFDB – Part #6: Pseudonyms

11.23.2011
| 2650 views |
  • submit to reddit
This is Part 6 in a series of 11 (so far) articles by Chris Hostetter in 2011 on Indexing and Searching the ISFDB.org data using Solr.

When we left last time, I had some decent modeling of Titles and Authors in distinct documents, but Pseudonyms were being treated as distinct Authors. Today I set out to deal with that.

(If you are interested in following along at home, you can checkout the code from github. I’m starting at the blog_5 tag, and as the article progresses I’ll link to specific commits where I changed things, leading up to the blog_6 tag containing the end result of this article.)

What’s a Pseudonym?

In the real world, a pseudonym is just an alternate name someone uses. In the ISFDB pseudonyms are actually modeled as real Author Objects, with metadata indicating that they have a pseudonym relationship with some other Author Object.

This affects our existing Solr Document model in a couple of annoying ways:

  • In an Author centric document, there is no indication if that Author has any pseudonyms; nor any way to search for that Author by a pseudonym.
  • In an Author centric document, there is no indication if that Author is a pseudonym for another author; nor any way to search for that pseudonym by the Author’s real name.
  • In a Title centric document, there is no indication if any of the listed authors are pseudonyms for other (real) Authors; nor any way to search for that Title by an Author’s real name.

As mentioned back in Blog #3, document modeling is all about thinking flat, and denormalizing the data, and that’s how we’re going to try and deal with pseudonyms.

Indexing Pseudonyms For Real Authors

Tackling the first issue was relatively straight forward. It was basically no different then when we added a list of email addresses for each author using a nested entity (and since once again the list of pseudonyms is relatively small, we can cache the entire thing in memory using the CachedSqlEntityProcessor)

So now when we do a search for Author’s named Asimov we not only see “Isaac Asimov” in the list, we also see that there are 6 pseudonyms for him, and we have the ID for each if we want to look up one of those records or see which Titles that pseudonym wrote

Of course to really make this useful, we also want a simple “names” field for Author docs, that lists all of the different names that (real) Author is known by. <copyField /> makes this trivial, so now a search for authors named “French” will not only return the alias “Paul French” but also the real author “Isaac Asimov”.

Indexing Real Names For Pseudonym Authors

My approach for adding the “real” names (and ids) to pseudonym Author documents was basically the same as adding pseudonym names/ids to “real” author documents. I would like to say there was an easy way to tweak and reuse the previously cached nested entity in DIH to do a reverse lookup, but I certainly couldn’t find one. Now when searching for authors named “Isaac Asimov” we not only get the “real” Isaac, but also his various pseudonyms. If we want to exclude synonyms from an author search, we can add a simple filter query: fq=-real_author_id:[* TO *].

I wanted to make that pseudonym filtering easier to do (and easier to facet on), by adding an “is_pseudonym” field. One way to do this might have been to use the TemplateTransformer on my nested “pseudonym_real_author” entity — but I was pretty sure that would have only set it for Authors that had that had a mapping to that nested entity; I want the boolean to be set for all Author docs. So instead I used my first ScriptTransformer to set the value of the field.

My first attempt didn’t work the way I expected at all. For every author, the row never contained a value for the “real_author_id” when my script was executed, so is_pseudonym was always false. As near as I can tell what seems to be happening is that since the Transformer was specified on the top level “author” entity, the script was being executed as soon as the “row” got populated with data from the top level “query”. (Disclaimer: I didn’t dig into the code to verify this, but reading the wiki again it makes sense). I couldn’t figure out an easy way around this, so for now I’ve ripped it out and will look into it more later.

Digression: Document Modeling Choices

A quick digression before we move on to adding pseudonym info to Title documents: I want to point out that I made a conscious Document Modeling Decision in the previous section, where I decided that “Pseudonym Authors” would still be indexed just like any other author — they would have have a few extra fields. IN particular, they still have a “doc_type” of “AUTHOR”. Another possible choice I could have made would be to have introduced a new “PSEUDO_AUTHOR” doc_type. I can’t really explain why I made the choice I did, it just felt more right given the vague notices I have about how I want to use this index. I want to be able to easily tell when an Author is really a pseudonym for another Author, but I don’t really need/want to treat those pseudonym documents as second class citizens. Maybe down the road I’ll run into a particular use case that will change my mind, but for now it made sense to continue to treat them the same as regular authors.

Indexing Real Names For Titles

My first attempt at indexing the real name/id for each author of a Title was basically the same as it was for the Author documents, just adding a nested entity using the pseudonyms table. The problem with this approach was that since it only added fields for authors that were pseudonyms, the list of “real_author_names” and “real_author_ids” in each title would be shorter then the list of credited authors when some were “real” and some were pseudonyms. For example, “The Lost” has four credited Authors, but one of them (“J. D. Robb”) is a pseudonym (for “Nora Roberts”). This is how those fields looked in the results…

  <arr name="author_ids">
    <str>2857</str>
    <str>36103</str>
    <str>136275</str>
    <str>35293</str>
  </arr>
  <arr name="author_names">
    <str>J. D. Robb</str>
    <str>Mary Blayney</str>
    <str>Patricia Gaffney</str>
    <str>Ruth Ryan Langan</str>
  </arr>
  <arr name="real_author_ids">
    <str>4853</str>
  </arr>
  <arr name="real_author_names">
    <str>Nora Roberts</str>
  </arr>

Can’t really tell whose who there can we?

So To improve on this, I removed the special sub-entity for pseudonym relationships, and instead I modified the existing “author” sub-entity to do a LEFT JOIN on the pseudonym table to populate the same fields. (Since LEFT JOIN is really a DB concept, and not anything special to Solr or DIH, I’m not going to bother explaining it here, but you can read about it online). So now those same fields look like…

  <arr name="author_ids">
    <str>2857</str>
    <str>36103</str>
    <str>136275</str>
    <str>35293</str>
  </arr>
  <arr name="author_names">
    <str>J. D. Robb</str>
    <str>Mary Blayney</str>
    <str>Patricia Gaffney</str>
    <str>Ruth Ryan Langan</str>
  </arr>
  <arr name="real_author_ids">
    <str>4853</str>
    <str>36103</str>
    <str>136275</str>
    <str>35293</str>
  </arr>
  <arr name="real_author_names">
    <str>Nora Roberts</str>
    <str>Mary Blayney</str>
    <str>Patricia Gaffney</str>
    <str>Ruth Ryan Langan</str>
  </arr>

Which makes them much more useful.

Conclusion (For Now)

Ok, thats going to wrap up this latest installment with the blog_6 tag. The index is in pretty good shape, we can now do some pretty interesting queries on Titles and Authors using either the real names of authors or the pseudonyms they use. I think next week I may really get my hands dirty and do some UI work so I can show off some screen shots.


Source:  http://www.lucidimagination.com/blog/2011/02/27/solr-powered-isfdb-part-6/

Published at DZone with permission of its author, Chris Hostetter.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Tags: