Skip to content

Commit

Permalink
Merge pull request #10887 from vera/feat/solr-field-types
Browse files Browse the repository at this point in the history
feat: index numerical and date fields in Solr with appropriate types + more targeted search result highlighting
  • Loading branch information
ofahimIQSS authored Jan 6, 2025
2 parents 2c471ae + 541946e commit 825ab15
Show file tree
Hide file tree
Showing 7 changed files with 528 additions and 66 deletions.
78 changes: 40 additions & 38 deletions conf/solr/schema.xml

Large diffs are not rendered by default.

82 changes: 82 additions & 0 deletions doc/release-notes/10887-solr-field-types.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
This release enhances how numerical and date fields are indexed in Solr. Previously, all fields were indexed as English text (text_en), but with this update:

* Integer fields are indexed as `plong`
* Float fields are indexed as `pdouble`
* Date fields are indexed as `date_range` (`solr.DateRangeField`)

Specifically, the following fields were updated:

- coverage.Depth
- coverage.ObjectCount
- coverage.ObjectDensity
- coverage.Redshift.MaximumValue
- coverage.Redshift.MinimumValue
- coverage.RedshiftValue
- coverage.SkyFraction
- coverage.Spectral.CentralWavelength
- coverage.Spectral.MaximumWavelength
- coverage.Spectral.MinimumWavelength
- coverage.Temporal.StartTime
- coverage.Temporal.StopTime
- dateOfCollectionEnd
- dateOfCollectionStart
- dateOfDeposit
- distributionDate
- dsDescriptionDate
- journalPubDate
- productionDate
- resolution.Redshift
- targetSampleActualSize
- timePeriodCoveredEnd
- timePeriodCoveredStart

This change enables range queries when searching from both the UI and the API, such as `dateOfDeposit:[2000-01-01 TO 2014-12-31]` or `targetSampleActualSize:[25 TO 50]`.

Dataverse administrators must update their Solr schema.xml (manually or by rerunning `update-fields.sh`) and reindex all datasets.

Additionally, search result highlighting is now more accurate, ensuring that only fields relevant to the query are highlighted in search results. If the query is specifically limited to certain fields, the highlighting is now limited to those fields as well.

## Upgrade Instructions

7\. Update Solr schema.xml file. Start with the standard v6.5 schema.xml, then, if your installation uses any custom or experimental metadata blocks, update it to include the extra fields (step 7a).

Stop Solr (usually `service solr stop`, depending on Solr installation/OS, see the [Installation Guide](https://guides.dataverse.org/en/6.5/installation/prerequisites.html#solr-init-script)).

```shell
service solr stop
```

Replace schema.xml

```shell
wget https://raw.githubusercontent.com/IQSS/dataverse/v6.5/conf/solr/schema.xml
cp schema.xml /usr/local/solr/solr-9.4.1/server/solr/collection1/conf
```

Start Solr (but if you use any custom metadata blocks, perform the next step, 7a first).

```shell
service solr start
```

7a\. For installations with custom or experimental metadata blocks:

Before starting Solr, update the schema to include all the extra metadata fields that your installation uses. We do this by collecting the output of the Dataverse schema API and feeding it to the `update-fields.sh` script that we supply, as in the example below (modify the command lines as needed to reflect the names of the directories, if different):

```shell
wget https://raw.githubusercontent.com/IQSS/dataverse/v6.5/conf/solr/update-fields.sh
chmod +x update-fields.sh
curl "http://localhost:8080/api/admin/index/solr/schema" | ./update-fields.sh /usr/local/solr/solr-9.4.1/server/solr/collection1/conf/schema.xml
```

Now start Solr.

8\. Reindex Solr

Below is the simplest way to reindex Solr:

```shell
curl http://localhost:8080/api/admin/index
```

The API above rebuilds the existing index "in place". If you want to be absolutely sure that your index is up-to-date and consistent, you may consider wiping it clean and reindexing everything from scratch (see [the guides](https://guides.dataverse.org/en/latest/admin/solr-search-index.html)). Just note that, depending on the size of your database, a full reindex may take a while and the users will be seeing incomplete search results during that window.
9 changes: 4 additions & 5 deletions src/main/java/edu/harvard/iq/dataverse/DatasetFieldType.java
Original file line number Diff line number Diff line change
Expand Up @@ -531,15 +531,14 @@ public String getDisplayName() {
public SolrField getSolrField() {
SolrField.SolrType solrType = SolrField.SolrType.TEXT_EN;
if (fieldType != null) {

/**
* @todo made more decisions based on fieldType: index as dates,
* integers, and floats so we can do range queries etc.
*/
if (fieldType.equals(FieldType.DATE)) {
solrType = SolrField.SolrType.DATE;
} else if (fieldType.equals(FieldType.EMAIL)) {
solrType = SolrField.SolrType.EMAIL;
} else if (fieldType.equals(FieldType.INT)) {
solrType = SolrField.SolrType.INTEGER;
} else if (fieldType.equals(FieldType.FLOAT)) {
solrType = SolrField.SolrType.FLOAT;
}

Boolean parentAllowsMultiplesBoolean = false;
Expand Down
100 changes: 79 additions & 21 deletions src/main/java/edu/harvard/iq/dataverse/search/IndexServiceBean.java
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,8 @@
import java.sql.Timestamp;
import java.text.SimpleDateFormat;
import java.time.LocalDate;
import java.time.format.DateTimeFormatter;
import java.time.format.DateTimeParseException;
import java.util.ArrayList;
import java.util.Calendar;
import java.util.Collection;
Expand All @@ -44,6 +46,7 @@
import java.util.function.Function;
import java.util.logging.Level;
import java.util.logging.Logger;
import java.util.regex.Pattern;
import java.util.stream.Collectors;
import jakarta.annotation.PostConstruct;
import jakarta.annotation.PreDestroy;
Expand Down Expand Up @@ -1065,34 +1068,89 @@ public SolrInputDocuments toSolrDocs(IndexableDataset indexableDataset, Set<Long
if (dsfType.getSolrField().getSolrType().equals(SolrField.SolrType.EMAIL)) {
// no-op. we want to keep email address out of Solr per
// https://github.com/IQSS/dataverse/issues/759
} else if (dsfType.getSolrField().getSolrType().equals(SolrField.SolrType.INTEGER)) {
// we need to filter invalid integer values, because otherwise the whole document will
// fail to be indexed
Pattern intPattern = Pattern.compile("^-?\\d+$");
List<String> indexableValues = dsf.getValuesWithoutNaValues().stream()
.filter(s -> intPattern.matcher(s).find())
.collect(Collectors.toList());
solrInputDocument.addField(solrFieldSearchable, indexableValues);
if (dsfType.getSolrField().isFacetable()) {
solrInputDocument.addField(solrFieldFacetable, indexableValues);
}
} else if (dsfType.getSolrField().getSolrType().equals(SolrField.SolrType.FLOAT)) {
// same as for integer values, we need to filter invalid float values
List<String> indexableValues = dsf.getValuesWithoutNaValues().stream()
.filter(s -> {
try {
Double.parseDouble(s);
return true;
} catch (NumberFormatException e) {
return false;
}
})
.collect(Collectors.toList());
solrInputDocument.addField(solrFieldSearchable, indexableValues);
if (dsfType.getSolrField().isFacetable()) {
solrInputDocument.addField(solrFieldFacetable, indexableValues);
}
} else if (dsfType.getSolrField().getSolrType().equals(SolrField.SolrType.DATE)) {
// Solr accepts dates in the ISO-8601 format, e.g. YYYY-MM-DDThh:mm:ssZ, YYYYY-MM-DD, YYYY-MM, YYYY
// See: https://solr.apache.org/guide/solr/latest/indexing-guide/date-formatting-math.html
// If dates have been entered in other formats, we need to skip or convert them
// TODO at the moment we are simply skipping, but converting them would offer more value for search
// For use in facets, we index only the year (YYYY)
String dateAsString = "";
if (!dsf.getValues_nondisplay().isEmpty()) {
dateAsString = dsf.getValues_nondisplay().get(0);
}
dateAsString = dsf.getValues_nondisplay().get(0).trim();
}

logger.fine("date as string: " + dateAsString);

if (dateAsString != null && !dateAsString.isEmpty()) {
SimpleDateFormat inputDateyyyy = new SimpleDateFormat("yyyy", Locale.ENGLISH);
try {
/**
* @todo when bean validation is working we
* won't have to convert strings into dates
*/
logger.fine("Trying to convert " + dateAsString + " to a YYYY date from dataset " + dataset.getId());
Date dateAsDate = inputDateyyyy.parse(dateAsString);
SimpleDateFormat yearOnly = new SimpleDateFormat("yyyy");
String datasetFieldFlaggedAsDate = yearOnly.format(dateAsDate);
logger.fine("YYYY only: " + datasetFieldFlaggedAsDate);
// solrInputDocument.addField(solrFieldSearchable,
// Integer.parseInt(datasetFieldFlaggedAsDate));
solrInputDocument.addField(solrFieldSearchable, datasetFieldFlaggedAsDate);
if (dsfType.getSolrField().isFacetable()) {
// solrInputDocument.addField(solrFieldFacetable,
boolean dateValid = false;

DateTimeFormatter[] possibleFormats = {
DateTimeFormatter.ISO_INSTANT,
DateTimeFormatter.ofPattern("yyyy-MM-dd"),
DateTimeFormatter.ofPattern("yyyy-MM"),
DateTimeFormatter.ofPattern("yyyy")
};
for (DateTimeFormatter format : possibleFormats){
try {
format.parse(dateAsString);
dateValid = true;
} catch (DateTimeParseException e) {
// no-op, date is invalid
}
}

if (!dateValid) {
logger.fine("couldn't index " + dsf.getDatasetFieldType().getName() + ":" + dsf.getValues() + " because it's not a valid date format according to Solr");
} else {
SimpleDateFormat inputDateyyyy = new SimpleDateFormat("yyyy", Locale.ENGLISH);
try {
/**
* @todo when bean validation is working we
* won't have to convert strings into dates
*/
logger.fine("Trying to convert " + dateAsString + " to a YYYY date from dataset " + dataset.getId());
Date dateAsDate = inputDateyyyy.parse(dateAsString);
SimpleDateFormat yearOnly = new SimpleDateFormat("yyyy");
String datasetFieldFlaggedAsDate = yearOnly.format(dateAsDate);
logger.fine("YYYY only: " + datasetFieldFlaggedAsDate);
// solrInputDocument.addField(solrFieldSearchable,
// Integer.parseInt(datasetFieldFlaggedAsDate));
solrInputDocument.addField(solrFieldFacetable, datasetFieldFlaggedAsDate);
solrInputDocument.addField(solrFieldSearchable, dateAsString);
if (dsfType.getSolrField().isFacetable()) {
// solrInputDocument.addField(solrFieldFacetable,
// Integer.parseInt(datasetFieldFlaggedAsDate));
solrInputDocument.addField(solrFieldFacetable, datasetFieldFlaggedAsDate);
}
} catch (Exception ex) {
logger.info("unable to convert " + dateAsString + " into YYYY format and couldn't index it (" + dsfType.getName() + ")");
}
} catch (Exception ex) {
logger.info("unable to convert " + dateAsString + " into YYYY format and couldn't index it (" + dsfType.getName() + ")");
}
}
} else {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -280,7 +280,7 @@ public SolrQueryResponse search(
List<DatasetFieldType> datasetFields = datasetFieldService.findAllOrderedById();
Map<String, String> solrFieldsToHightlightOnMap = new HashMap<>();
if (addHighlights) {
solrQuery.setHighlight(true).setHighlightSnippets(1);
solrQuery.setHighlight(true).setHighlightSnippets(1).setHighlightRequireFieldMatch(true);
Integer fragSize = systemConfig.getSearchHighlightFragmentSize();
if (fragSize != null) {
solrQuery.setHighlightFragsize(fragSize);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ public enum SolrType {
* support range queries) in
* https://github.com/IQSS/dataverse/issues/370
*/
STRING("string"), TEXT_EN("text_en"), INTEGER("int"), LONG("long"), DATE("text_en"), EMAIL("text_en");
STRING("string"), TEXT_EN("text_en"), INTEGER("plong"), FLOAT("pdouble"), DATE("date_range"), EMAIL("text_en");

private String type;

Expand Down
Loading

0 comments on commit 825ab15

Please sign in to comment.