All work

Select view

Select search mode

 
50 of 132

Search does not provide accurate results and highlights for Japanese content neither on Elasticsearch nor on Solr - DM

Inactive

Description

Similar for WCM: See .

Steps to Reproduce - master/7.0.x

  1. Start Liferay

  2. Set Japenese as the portal's default language: Control Panel - Configuration - Instance Settings - Misc - Default Language

  3. Upload

  1. to Documents and Media

  2. Upload

  1. to Documents and Media with the following metadata:

    1. Title: サンプルB

    2. Description: これは東京都品川区で登録したファイルです

  2. Add Language Selector portlet to the default page

  3. Switch to Japanese locale

Searching & Results

a. Searched for the string English Japanese

  • SEARCH RESULT: PASS - Document3.txt is displayed

  • HIGHLIGHT: PASS - Both English and Japanese are highlighted as expected in Document3.txt.

b. Searched for the string あいうえお 日本語 (aiueo nihongo)

  • SEARCH RESULT: PASS - Both Document1.txt and Document2.txt are available in the search results.

  • HIGHLIGHT: FAIL - Partially working. 日本語 was highlighted as expected, but あいえうお is NOT highlighted. Strangely, only あい is highlighted.

c. Searching for partial strings such as あいう (aiu)

  • SEARCH RESULT: FAIL - No search results are present, even though it is expected that あいう is present in Document1.txt

  • HIGHLIGHT: FAIL - Since there are no search results, nothing can be highlighted.

d. Search for サンプル (sampuru)

  • SEARCH RESULT: PASS - Document サンプルB is visible.

  • HIGHLIGHT: PASS - Only text that says サンプル is highlighted.

e. Search for 推進 (suishin)

  • SEARCH RESULT: PASS - Document is visible

  • HIGHLIGHT: PASS - The string is highlighted in the DM result.

f. Search for 推進部 (suishinbu)

  • SEARCH RESULT: PASS - Document is visible

  • HIGHLIGHT: FAIL - The string is highlighted in the DM result, but strangely enough, the last character of the string is highlighted also.

g. Search for 品川区 (shinagawaku)

  • SEARCH RESULT: PASS - DM result is found.

  • HIGHLIGHT: FAIL - The document contains this string in the actual content, but because the description is the only thing that's displayed, no highlights are present.


Reproduced on master@b7df384c4f71832b3afe5f67d65d3a641d1bbee3
Reproduced with Remote Elasticsearch 2.4.x
Reproduced with Solr: https://dev.liferay.com/discover/deployment/-/knowledge_base/7-0/using-solr - Tested with Liferay Solr 5 Search Engine 1.0.0

  • It does not work either if you change the assigned analyzer to text_ja for fields content, description, subtitle, title in schema.xml and hit a reindex either.
    Most probably it affects other assets as well

Attachments

9
Created September 12, 2017 at 6:20 AM
Updated June 26, 2023 at 1:24 AM
Resolved June 1, 2022 at 4:46 AM

Activity

Show:

Yasuyuki TakeoNovember 7, 2017 at 5:47 PM
Edited

, ,

Is there way to separate Analyzer / Tokenizer each for indexing and searching? It may also affect the result as well...

Yasu

Yasuyuki TakeoNovember 2, 2017 at 8:02 AM

I just tested b,c and f. I found out these 3 cases aren't a bug, the design of Kuromoji.

I checked how Elasticsearch returns response for each query, Liferay actually reflect the correct data that Elasticsearch returns.

The reason why the behavior is how Kuromoji's designed is because the morphological engine tokenizes words based on certain algorithms and embedded dictionary.

As developers working on tuning Japanese search, usually have to tune boost, adding certain words into the dictionary or using multiple index fields and combine them for meeting requirements for each case at projects.

In terms of case b,

the words "あいうえお" means kind of "abcde" in English, which doesn't mean nothing. So the result, they are tokenized "あい" and "うえ"、which make sense based on Japanese. If you really want to highlight the whole "あいうえお" as one word, then you may want to add the "あいうえお" as a noun.

In terms of case c,

I looked into analyzer as

then only あい and お are tokenized.(Kuromoji generalize Kanji and HIragana all to Katakana, so the アイ and あい、オ and お are same.

but あいう is tokenized as

So it doesn't match. This also makes sense.

In terms of case f,

推進部 is tokenized as follows

 

Again, the Kanji are translated into Katakana. it looks like 推進(スイシン) and 部(ブ) are tokenized as each words. So it makes sense that only 部 also hits.

Yasuyuki TakeoOctober 31, 2017 at 5:25 PM

and are currently working on manual testing with new mappings according to 's request.

Yasuyuki TakeoOctober 11, 2017 at 8:41 AM

,
I just created a pull request to , https://github.com/BryanEngler/liferay-portal/pull/220.
Please let me know if something missing.

Jordi RodóSeptember 13, 2017 at 11:12 PM

Hey Tibor,

Just playing with ES & kuromoji may reveal that the problem of "incorrect" highlighting is caused by how text gets tokenized based on the current mapping and settings:

Looks like morphological analysis is being performed. Did the same in Solr/Kuromoji and looks like we get the same results.

(see attached aiueo.png, "text" and "baseForm" rows).

Which makes me wonder if we should check whether Liferay is displaying and highlighting as specified by the search engine only and leave the accuracy to search engine tuning.

Regards,
Jordi