Indexing rich documents with Rails, Sunspot, Solr, Sunspot Cell and Carrierwave (cookbook-style)

Solr / Sunspot installation and configuration is easy when you just need to index and search your model data. I won’t go into details about configuring basic Sunspot / Solr for Rails here. For a great primer on Sunspot and basic installation instructions, I recommend Ryan Bates’ Railscast.

But configuring Solr to index rich documents (e.g. PDFs, Word documents) via Sunspot Cell is really quite badly documented, and it doesn’t need to be. Once you configure things correctly, it really does work. Learn from my pain…

Challenge 1: having the right gems in the right order:

Rumor has it that the order of the gems is important. The following gems work for me:

gem 'sunspot', :git => "git://github.com/sunspot/sunspot.git"
gem 'sunspot_solr', :git => "git://github.com/sunspot/sunspot.git"
gem 'sunspot_rails', :git => "git://github.com/sunspot/sunspot.git", :require => "sunspot_rails"
gem 'sunspot_cell', :git => 'git://github.com/zheileman/sunspot_cell.git'
gem 'sunspot_cell_jars'
gem 'progress_bar'

Note that there are several ‘competing’ branches of Sunspot Cell on Github. ZHeileman’s appears to be (at time of writing) to be the most current one.

After you run ‘bundle’, don’t forget to run the following command to install the Solr jarfiles in your tree.

rails g sunspot_cell_jars:install

Challenge 2: reconfiguring the Solr schema.xml file:

Edit solr/conf/schema.xml as follows.

Add the following two lines just before the </fields> terminator. For me that’s around line 250.

<dynamicField name="*_attachment" stored="true" type="text" multiValued="true" indexed="true"/>
<dynamicField name="ignored_*" type="ignored"/>

Add the following line just before the </types> terminator. For me that’s around line 97.

<fieldType name="ignored" stored="false" indexed="false" multiValued="true" class="solr.StrField" />

Now you should “rake sunspot:solr:stop && rake sunspot:solr:start”

Challenge 3: configuring your attachment model:

My Carrierwave file-upload model is Attachment.rb.

class Attachment < ActiveRecord::Base
  # The user who uploaded the file owns it
  belongs_to :user
  # Attachment is polymorphic because a variety of things (project, team)
  # are "attachable", i.e. accept file uploads
  belongs_to :attachable, :polymorphic => true

  # If you recall attachment_fu, this is kind of like "has_attachment":
  mount_uploader :upload, UploadUploader
  before_save :update_upload_attributes

  # Sunspot indexing configuration
  searchable do
    # Regular fields that I want indexed
    integer :id
    time :created_at
    time :updated_at
    text :file_name

    # For Sunspot Cell. The 'attachment' directive instructs
    # Cell how to get the binary data. My understanding is that
    # this *must* end in _attachment
    attachment :document_attachment
  end

  # Goes hand-in-hand with the item above. Now, this is important:
  # the return value from this method is NOT the binary data itself,
  # but rather the full URI to the file. Cell will use this to locate
  # the file and index it.
  def document_attachment
    "#{Rails.root}/public/#{upload.url}"
  end

  private

  # https://github.com/jnicklas/carrierwave/wiki/How-to%3A-Store-the-uploaded-file-size-and-content-type
  def update_upload_attributes
    if upload.present? && upload_changed?
      self.content_type = upload.file.content_type
      self.file_size = upload.file.size
      self.file_name = File.basename(upload.url)
    end
  end
end

Challenge 4: reindexing your content and testing the search

With all that configuration complete, now you should be able to reindex your existing content (which includes your binary files) using the following command:

% rake sunspot:reindex
[############################################] [570/570] [100.00%] [00:02] [00:00] [232.66/s]

% rails console
> res=Attachment.solr_search do |s|
>   s.keywords 'timed'
> end
SOLR Request (5.4ms) [ path=#<RSolr::Client:0x007f8c970ce880> parameters={data: 
fq=type%3AAttachment&q=timed&fl=%2A+score&qf=file_name_text+document_attachment_attachment
&defType=dismax&start=0&rows=30, method: post, params: {:wt=>:ruby}, query: 
wt=ruby, headers: {"Content-Type"=>"application/x-www-form-urlencoded; 
charset=UTF-8"}, path: select, uri: http://localhost:8982/solr/select?wt=ruby, 
open_timeout: , read_timeout: } ]

=> <Sunspot::Search:{:fq=>["type:Attachment"], :q=>"timed", :fl=>"* score", 
:qf=>"file_name_text document_attachment_attachment", :defType=>"dismax", 
:start=>0, :rows=>30}>

> res.results
Attachment Load (0.5ms) SELECT "attachments".* FROM "attachments" WHERE 
"attachments"."id" IN (19)

=> [#<Attachment id: 19, attachable_id: nil, attachable_type: nil, upload: 
"test_word_document.doc", description: nil, file_name: "test_word_document.doc", 
content_type: "application/msword", file_size: 23040, created_at: "2012-09-15 
00:25:41", updated_at: "2012-09-15 00:25:41", user_id: 101>]


About these ads

8 thoughts on “Indexing rich documents with Rails, Sunspot, Solr, Sunspot Cell and Carrierwave (cookbook-style)

  1. Pingback: Indexing rich documents with Rails, Sunspot, Solr, Sunspot Cell and … | Solar Flare 2012

  2. Were you able to get this to work in your production environment? I’ve been able to do this in development without any problems, but I keep getting a `org.apache.solr.common.SolrException: lazy loading error` when I try to save a new record with an attachment in production. I have a feeling it’s a result of my production solr setup, but I’m at a loss for what to do to fix it…

    • I do have it working in production without the error you describe.

      The one problem I’ve found in prod is that Attachments aren’t indexed automatically at creation time, like other content is. I have to cron a periodic “reindex” job to get Attachments indexed; I haven’t the faintest idea why.

  3. For me, this didn’t work for me after starting from a “blank slate.” For example, the instructions in this post don’t have you run the sunspot generator. I spent some time trying to figure out why I didn’t have a solr/conf/schema.xml.

    I recommend installing plain vanilla Sunspot following the Railscast episode first. Then layer on the suggestions here. About to give that part a shot now after getting basic Sunspot up and running.

    I _really_ wish the Sunspot guys would fix this mess!

      • No need to apologize! I’m not sure where I’d be without this post, so thanks very much for it.

        Trying to get something into production, all of those git references in the Gemfile make me a little nervous. It’d be my luck that something would change in master on one of my deploys.

        I’m also a little leery about needing to schedule a task to periodically reindex everything.

        I’m trying to figure out which combination of gems/versions I can use to *not* require the cron scheduling for grabbing new creates, and just having it work in general. This is very frustrating!

    • I also did this the same way initially. I viewed the Railscast and put all of the required gems in my Gemfile, then came here and continued to pile them on before installing any. I installed everything at once, and doing it that way lead me to many errors. I also wasn’t seeing a “conf” subdirectory in my app’s “solr” directory.

      I then realized that you really have to follow the Railscast – installing ONLY the gems it mentioned (I found that I needed sunspot_solr AND sunspot_rails), starting the Solr instance, and trying out the reindex command – before you install any of these gems in this article. Everything worked as expected, and I finally saw a “conf” directory show up in my project’s “solr” directory.

      After that was all working, I then stopped my Solr instance and my app server instance and followed the instructions here. I added the additional gems from this article into my Gemfile, did a “bundle install”, and continued to follow this article step-by-step.

      So, you must first follow the linked Railscast at the beginning of this article before trying anything here. You can’t just combine the two guides and do it all at once.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s