The Document sobject has an IsBodySearchable flag which controls if you can full text search it via SOSL.
ContentVersion has no such flag but I know that it is searchable in certain cases. However, documentation around this is frustratingly hard to find.
Does anyone know what the limitations are and when contentversion records are/are not searchable?
bonus question: Assuming we have several hundred thousand documents in SF (approximately 2TB) can global search and/or SOSL still work without timing out?
The only global limit I see anywhere is the following:
Searching document content supports multiple file types and has file size limits. The contents of documents that exceed the maximum sizes are not searched; however, the document fields are still searched. Only the first 1,000,000 characters of text are searched. Text beyond this limit is not included in the search.
I am unclear on whether this is a per-doc or global limit and what it means when text is not included in the search.
Does that mean that text is not part of the index? Or is it just that any given search will go through up to a million characters of text to try and find a match and time out if no match is found?
Attribution to: Greg Grinberg
Possible Suggestion/Solution #1
You can search for ContentVersion object like any other object. You can only search the latest version of these records though: https://developer.salesforce.com/docs/atlas.en-us.api.meta/api/sforce_api_objects_contentversion.htm
SOQL queries on the ContentVersion object return all versions of the document. SOSL searches on the ContentVersion object return only the most recent version of the document.
Hundreds of thousands records/2TB is a small dataset, so you should not worry about timeout.
The limit you mention is per document, and it is the raw text (i.e. after the original document has been transformed to text - without any formatting), so it is pretty high.
Another comment indicated searching feeds. When you search feeds, only the feeds text is searched (not the attachments). You can search using SOSL, but it is not very convenient due to the output format. It is better to use the Chatter API: https://developer.salesforce.com/page/Chatter_API
Attribution to: Marc Brette
Possible Suggestion/Solution #2
You are hitting one of the main limitations of Salesforce: its search capabilities. I don't think there is an easy way to search versions of content. Plus search is likely to timeout when searching on dataset by cutting the number of results at 2000 occurrences, if I remember correctly. There are also other major limitations that limit what you can achieve with Apex and SOSL (i.e: you cannot access results beyond the first page).
The best way to get around this issue is to externalise the search.
I believe there are some apps on the AppExchange that do that, like KonaSearch for example.
If your need are more specific you could technically implement an external app on Heroku, fetch data via APIs and store them in an search index like WebSolr or ElasticSearch. I've done it using WebSolr in the past, it worked very well, but I am afraid it's not a simple thing.
I hope it helps.
Attribution to: Lorenzo Frattini
This content is remixed from stackoverflow or stackexchange. Please visit https://salesforce.stackexchange.com/questions/32543