#98 new
a.chaplin (at nittygritty)

Solr Cell attachment indexing patch

Reported by a.chaplin (at nittygritty) | March 18th, 2010 @ 08:20 AM | in Wishlist

Comments and changes to this ticket

  • a.chaplin (at nittygritty)

    a.chaplin (at nittygritty) March 18th, 2010 @ 08:41 AM

    • Assigned user set to “mat”
    • Title changed from “Solr Cell ” to “Solr Cell attachment indexing patch”

    Hi,

    Attached is a patch that implements SolrCell rich text indexing in sunspot.

    It adds an "attachment" field to the dsl.

    eg

    Sunspot.setup(Post) do

    attachment :rich_document
    

    end

    The filename in rich document is then sent to Solr, where Solr reads the file and indexes it.

    On searching, the attachment fields are added as full text fields before the search query is sent to Solr.

    There are some outstanding issues:

    At the moment the filename is passed to solr so the solr server needs to be able to see the same paths as client.

    Only one search term hit in a document is returned from Solr, I'm not sure why that is. There are a couple of integration tests that fail because of this.

    Solr will only index PDFs in its current configuration. Adding more parsers will add more document types.

    The schema.xml we're using comes from previous version of sunspot and is different in format to the new one. We need some advice in bringing our schema.xml (attached) up to date.

    I've sunmitted this patch as a starting point for this implementation and will submit a proper clean patch later, I just need some help in getting there.

    Regards
    Allan

  • a.chaplin (at nittygritty)
  • a.chaplin (at nittygritty)
  • mat

    mat March 19th, 2010 @ 02:58 PM

    • Tag cleared.
    • Milestone set to Feature Requests
  • mat

    mat March 22nd, 2010 @ 06:33 PM

    Pulled down your patch, and it looks like the problem is that you're missing the Solr Cell jar and its dependencies. To get them, unpack the standard Solr 1.4 distribution, and copy dist/apache-solr-cell-1.4.0.jar and contrib/extraction/*.jar into the solr/solr/lib directory in Sunspot.

    But here's the problem -- the Solr Cell dependencies are 31MB! That's a lot, and it's really more than I'd like to add into something that's bundled with the gem. I might need to give some thought to a better way to distribute a prepackaged Solr instance (something to the effect of downloading and installing the various JARs only when sunspot-solr is first run; that way at least in production, you don't have a bunch of useless bloat).

    Anyway, that's for me to worry about -- for now, by including those dependencies, you should be able to get your specs passing and keep moving on the patch. Keep me posted on how things go!

  • a.chaplin (at nittygritty)

    a.chaplin (at nittygritty) March 24th, 2010 @ 05:15 AM

    Thanks for your help and support Mat.

    I've dropped the jars in and that sorts out the indexing of documents other than PDF.

    I'm only getting one search result for each document (even if there if the search result term appears more than once in the document) Any idea of the top of your head why this might be?

    I'm not going to have a chance to look at this until Friday, when I'll look at streaming the document contents to Solr and then take a fresh look at the hit results for documents contain multiple instances of the search term.

  • mat

    mat March 24th, 2010 @ 08:45 AM

    You're talking about highlighting, right? By default, Solr only returns one highlighted snippet per document. But that can easily be changed -- just use the :max_snippets option to the highlight method (I believe 0 means unlimited): http://outoftime.github.com/sunspot/docs/classes/Sunspot/DSL/Fullte...

  • a.chaplin (at nittygritty)

    a.chaplin (at nittygritty) March 29th, 2010 @ 05:27 AM

    Here is the second cut of the patch for the rich document extraction, I think its complete now.

    It now streams the document content as the message body.

    I've removed our debugging code.

    Aside from two tests that failed before my patch was applied all tests pass.

    I've added integration tests for attachment content extraction and attachement highlighting.

    Let me know what you think.

    Regards

    Allan

  • mat

    mat March 30th, 2010 @ 11:56 AM

    • Tag set to v1.2
  • Isaac Kearse

    Isaac Kearse May 24th, 2010 @ 08:22 PM

    Hey Guys,

    FYI I merged this patch into my fork on the "cell" branch here:
    http://github.com/isaac/sunspot/commit/7b127a9536c0182572a7de37a0d4...

    Planning to test it out thoroughly over the next days, will let you know how it goes.

    Let me know if there are any changes that need to be made before this gets merged into the master repo.

    Cheers,
    Isaac

  • Nick Zadrozny

    Nick Zadrozny September 19th, 2010 @ 10:50 PM

    • Milestone order changed from “0” to “0”

    @Isaac: How has that branch been working out for you? Would love to see your branch rebased against the latest Sunspot master.

  • Isaac Kearse

    Isaac Kearse September 20th, 2010 @ 09:37 PM

    Done. Just tried to post details to the google group but I think my post is in moderation here: http://groups.google.com/group/ruby-sunspot/browse_thread/thread/f3...

    I'll post it here as well just in case it didn't go through:

    Hey Guys,

    My cell branch was a bit of a mess so I have cherry-picked the necessary commits and reapplied them on top of outoftime/sunspot master. You can find all the changes on my master branch here: http://github.com/isaac/sunspot

    All the attachment specs now pass for me - note that you need to follow the instructions here from Matt about the Solr Cell .jar and dependencies here: http://outoftime.lighthouseapp.com/projects/20339/tickets/98-solr-cell

    Cheers,
    Isaac

  • Daniel Neighman

    Daniel Neighman December 23rd, 2010 @ 08:52 PM

    Hey guys,

    Just wondering where this is at... Is this a usable patch at the moment or should I wait till it gets brought into core?

  • mat

    mat December 27th, 2010 @ 02:40 PM

    • Milestone changed from Feature Requests to Wishlist
    • Tag changed from v1.2 to feature
    • Milestone order changed from “75” to “0”
  • Luis Pollo

    Luis Pollo August 24th, 2012 @ 12:28 PM

    Hi Mat,

    About your reply in http://outoftime.lighthouseapp.com/projects/20339/tickets/98-solr-c..., I am currently trying to show highlights for an attachment field, but Hit.highlight(:field_name) returns nil.

    My indexing setup looks like this:

    class Document < ActiveRecord::Base
    searchable do

    text :path
    text :extension
    # attached file contents for full-text search within documents 
    attachment :attached_file
    

    end # ... end

    And my search like this:
    @search = Document.search do fulltext params[:q] do

    highlight :attached_file, :max_snippets => 0
    

    end end

    Do you have any idea why that happens?

    Thanks!

Please Sign in or create a free account to add a new ticket.

With your very own profile, you can contribute to projects, track your activity, watch tickets, receive and update tickets through your email and much more.

New-ticket Create new ticket

Create your profile

Help contribute to this project by taking a few moments to create your personal profile. Create your profile »

Awesome Solr interaction for Ruby

Tags

Referenced by

Pages