Import Wikipedia Pages into AEM

Contents

Testing a search-engine like [ElasticSearch in AEM]({{ site.baseurl }}{% post_url 2017-01-18-elasticsearch-aem %}) requires a certain amount of pages you can index and search into.

There are multiple great sources for content available but (of course) none of them provides an export into AEM. For example you can download dumps from Stackoverflow1 or Wikipedia2. I think espacially Wikipedia dumps are a quite interesting source of content as they are available in nearly (?) all languages (e.g. german3) so language-specific searches can be tested too.

I created a simple tool which takes the dump and builds a structure that can be included into a content-package and installed in AEM through the Package Manager4.

For each Wikipedia page a single page is created using the wcm.io Sample Application Templates5.

Usage

After checking out the Github Repository6 you need to build the project using maven. Now you can use a downloaded dump (or one of the provided examples).

java -jar target/wiki2aem-1.0-SNAPSHOT-jar-with-dependencies.jar wiki_dump_small.xml output

You’ll now find all the pages in the given output folder.

Footnotes

Tags

Comments

Related