Comments and changes to this ticket
- Assigned user set to mat
- Title changed from Solr Cell to Solr Cell attachment indexing patch
Attached is a patch that implements SolrCell rich text indexing in sunspot.
It adds an "attachment" field to the dsl.
The filename in rich document is then sent to Solr, where Solr reads the file and indexes it.
On searching, the attachment fields are added as full text fields before the search query is sent to Solr.
There are some outstanding issues:
At the moment the filename is passed to solr so the solr server needs to be able to see the same paths as client.
Only one search term hit in a document is returned from Solr, I'm not sure why that is. There are a couple of integration tests that fail because of this.
Solr will only index PDFs in its current configuration. Adding more parsers will add more document types.
The schema.xml we're using comes from previous version of sunspot and is different in format to the new one. We need some advice in bringing our schema.xml (attached) up to date.
I've sunmitted this patch as a starting point for this implementation and will submit a proper clean patch later, I just need some help in getting there.
Pulled down your patch, and it looks like the problem is that you're missing the Solr Cell jar and its dependencies. To get them, unpack the standard Solr 1.4 distribution, and copy
solr/solr/libdirectory in Sunspot.
But here's the problem -- the Solr Cell dependencies are 31MB! That's a lot, and it's really more than I'd like to add into something that's bundled with the gem. I might need to give some thought to a better way to distribute a prepackaged Solr instance (something to the effect of downloading and installing the various JARs only when sunspot-solr is first run; that way at least in production, you don't have a bunch of useless bloat).
Anyway, that's for me to worry about -- for now, by including those dependencies, you should be able to get your specs passing and keep moving on the patch. Keep me posted on how things go!
Thanks for your help and support Mat.
I've dropped the jars in and that sorts out the indexing of documents other than PDF.
I'm only getting one search result for each document (even if there if the search result term appears more than once in the document) Any idea of the top of your head why this might be?
I'm not going to have a chance to look at this until Friday, when I'll look at streaming the document contents to Solr and then take a fresh look at the hit results for documents contain multiple instances of the search term.
You're talking about highlighting, right? By default, Solr only returns one highlighted snippet per document. But that can easily be changed -- just use the
:max_snippetsoption to the
highlightmethod (I believe 0 means unlimited): http://outoftime.github.com/sunspot/docs/classes/Sunspot/DSL/Fullte...
Here is the second cut of the patch for the rich document extraction, I think its complete now.
It now streams the document content as the message body.
I've removed our debugging code.
Aside from two tests that failed before my patch was applied all tests pass.
I've added integration tests for attachment content extraction and attachement highlighting.
Let me know what you think.
FYI I merged this patch into my fork on the "cell" branch here:
Planning to test it out thoroughly over the next days, will let you know how it goes.
Let me know if there are any changes that need to be made before this gets merged into the master repo.
Done. Just tried to post details to the google group but I think my post is in moderation here: http://groups.google.com/group/ruby-sunspot/browse_thread/thread/f3...
I'll post it here as well just in case it didn't go through:
My cell branch was a bit of a mess so I have cherry-picked the necessary commits and reapplied them on top of outoftime/sunspot master. You can find all the changes on my master branch here: http://github.com/isaac/sunspot
All the attachment specs now pass for me - note that you need to follow the instructions here from Matt about the Solr Cell .jar and dependencies here: http://outoftime.lighthouseapp.com/projects/20339/tickets/98-solr-cell
About your reply in http://outoftime.lighthouseapp.com/projects/20339/tickets/98-solr-c..., I am currently trying to show highlights for an attachment field, but Hit.highlight(:field_name) returns nil.
My indexing setup looks like this:
class Document < ActiveRecord::Base
text :path text :extension # attached file contents for full-text search within documents attachment :attached_file
end # ... end
And my search like this:
@search = Document.search do fulltext params[:q] do
highlight :attached_file, :max_snippets => 0
Do you have any idea why that happens?