Setup a Solr schema.xml for AEM

Contents

Now that we have successfully convinced AEM to use Solr as Indexer, the next step is to create a Schema which is used by Solr for Index/Query Processing.

Why do we need a schema?

Solr does not know anything about your data structure but you want it to perform complex operation like fulltext searches, faceting etc. To allow Solr to create a fast index, you need to define which fields you want to index and which operations should be performed upon index or query1.

There is an excellent book by Trey Grainger2 and Timothy Potter which gives a good view on the capabilities of Solr3. Although it is written for Solr 5 most of the concepts are the same for Solr 6 and just need minimal adjustments.

By default Solr 6 uses a managed-schema.xml4 which allows you to use the Schema API5 to modify the schema. You can change this behavior in solrconfix.xml per core and enable the classic schema.xml which we’ll use in this example.

The Jackrabbit project provides a basic configuration for a core you can use with Solr 4.x6 and as base for a custom configuration. I recommend that you have a look at the schema.xml which is the base for the following definitions.

Schema.xml for AEM

You can find an example for a basic schema.xml7 in the aem-solr Github repository8 which I’ll explain here.

Unique Key

The uniqueKey field is the identity of an indexed document. If a new document with an already existing uniqueKey is indexed it replaces the existing entry. For structured content like a JCR content the path is a great identifier and therefor used.

Fields

Path*

Since you most likely not only want to query the complete index but restrict your queries to certain paths, some adjustments are required here. The Jackrabbit Oak Solr indexer supports multiple fields out of the box that should be added to your schema9. The documentation also provides some examples, where those fields are used.

Note: Only the field path_exact is stored in our index and is therefor retrievable. All other fields are only used for indexing.

JCR/Sling and DAM attributes

The schema.xml contains some interesting JCR attributes like jcr_title or jcr_lastModified that can be queried as string or date (e.g before xyz). To allow queries of DAM assets, you can also see the mimetype attributes of DAM.

Content attributes

For this example I’ll use three different JCR properties that should be index:

FieldnameIndex as
headlineSimple String, no fulltext search
titleSimple String, no fulltext search
textEnglish text, indexed for fulltext search, suggestions etc

Fieldtypes

All fieldtypes you can find in the schema.xml are quite simple and by the book. There are primitive fieldtypes like int or string but also types that support fulltext searches like text_en.

For the two *_path fieldtypes some rules that replace or group the result by slashes are defined.

Summary

For a simple AEM application where you want to perform fulltext searches on predefined fields (like text) the provided schema is a good starting point. You can extend it by adding additional fields or using the copyField10 mechanism to index more fields into the already defined ones.

If your application uses a property named richText which you want to index, the following definition would copy it into the text field and merge the results:

<copyField source="richText" dest="text">

The next post will deal with a sample application you can setup to get a better insight of the already achieved steps.

Footnotes

Tags

Comments

Related