All work

Select view

List view

Detail view

Select search mode

Basic

JQL

50 of 132

Search does not provide accurate results and highlights for Japanese content neither on Elasticsearch nor on Solr - DM

Inactive

Description

Similar for WCM: See .

Steps to Reproduce - master/7.0.x

Start Liferay
Set Japenese as the portal's default language: Control Panel - Configuration - Instance Settings - Misc - Default Language
Upload

to Documents and Media
Upload

to Documents and Media with the following metadata:
1. Title: サンプルＢ
2. Description: これは東京都品川区で登録したファイルです
Add Language Selector portlet to the default page
Switch to Japanese locale

Searching & Results

a. Searched for the string English Japanese

SEARCH RESULT: PASS - Document3.txt is displayed
HIGHLIGHT: PASS - Both English and Japanese are highlighted as expected in Document3.txt.

b. Searched for the string あいうえお　日本語 (aiueo nihongo)

SEARCH RESULT: PASS - Both Document1.txt and Document2.txt are available in the search results.
HIGHLIGHT: FAIL - Partially working. 日本語 was highlighted as expected, but あいえうお is NOT highlighted. Strangely, only あい is highlighted.

c. Searching for partial strings such as あいう (aiu)

SEARCH RESULT: FAIL - No search results are present, even though it is expected that あいう is present in Document1.txt
HIGHLIGHT: FAIL - Since there are no search results, nothing can be highlighted.

d. Search for サンプル (sampuru)

SEARCH RESULT: PASS - Document サンプルB is visible.
HIGHLIGHT: PASS - Only text that says サンプル is highlighted.

e. Search for 推進 (suishin)

SEARCH RESULT: PASS - Document is visible
HIGHLIGHT: PASS - The string is highlighted in the DM result.

f. Search for 推進部 (suishinbu)

SEARCH RESULT: PASS - Document is visible
HIGHLIGHT: FAIL - The string is highlighted in the DM result, but strangely enough, the last character of the string 部 is highlighted also.

g. Search for 品川区 (shinagawaku)

SEARCH RESULT: PASS - DM result is found.
HIGHLIGHT: FAIL - The document contains this string in the actual content, but because the description is the only thing that's displayed, no highlights are present.

Reproduced on master@b7df384c4f71832b3afe5f67d65d3a641d1bbee3
Reproduced with Remote Elasticsearch 2.4.x
Reproduced with Solr: https://dev.liferay.com/discover/deployment/-/knowledge_base/7-0/using-solr - Tested with Liferay Solr 5 Search Engine 1.0.0

It does not work either if you change the assigned analyzer to text_ja for fields content, description, subtitle, title in schema.xml and hit a reindex either.
Most probably it affects other assets as well

Attachments

Linked work items

depends on

LPSA-53497

When searching, text is not highlighted if you search for multiple words that appeared consecutively in the same order in the document

LPSA-49761

Only the last character of Japanese text is highlighted

LPD-13801

Improved Japanese Localized Search: Tweak Kuromoji analyzer for search needs to match and highlight properly and Synonyms support

LPSA-73421

Multi Language Search for DM

fixes

LPSA-56182

Japanese text in the DM content field is being sent to the English analyzer in Elasticsearch

Pinned fields

Click on the next to a field label to start pinning.

Details
Assignee
SE Support
Reporter
Tibor Lipusz
Labels
7.2-won't7.3-known-issues7.4-known-issuesReleaseValveDXPSolr2017SEPee-tsliferay-ga1-dxp-7413liferay-ga10-ce-743liferay-ga11-ce-743liferay-ga12-ce-743liferay-ga13-ce-743-known-issueliferay-ga14-ce-743-known-issuesliferay-ga15-ce-743-known-issuesliferay-ga16-ce-743-known-issuesliferay-ga17-ce-743-known-issuesliferay-ga18-ce-743-known-issuesliferay-ga19-ce-743-known-issuesliferay-ga2-ce-741liferay-ga20-ce-743-known-issuesliferay-ga21-ce-743-known-issuesliferay-ga22-ce-743-known-issuesliferay-ga23-ce-743-known-issuesliferay-ga24-ce-743-known-issuesliferay-ga25-ce-743-known-issuesliferay-ga26-ce-743-known-issuesliferay-ga27-ce-743-known-issuesliferay-ga4-ce-733liferay-ga4-ce-743liferay-ga5-ce-734liferay-ga5-ce-743liferay-ga6-ce-735liferay-ga6-ce-743liferay-ga7-ce-736liferay-ga7-ce-743liferay-ga8-ce-737liferay-ga8-ce-743liferay-ga9-ce-743liferay-u1-dxp-7413liferay-u2-dxp-7413lima-board-outreviewed-by-pm
Epic/Theme
Search_Tech_Debt
Fix Priority
4
Development End Date
Jun 01, 2022, 4:46 AM
Components
Affects versions
7.0.0 DXP FP30
7.0.X
7.1.1 CE GA2
7.1.10 DXP FP4
Priority
Medium

Zendesk Support

Created September 12, 2017 at 6:20 AM

Updated June 26, 2023 at 1:24 AM

Resolved June 1, 2022 at 4:46 AM

Activity

Show:

Yasuyuki TakeoNovember 7, 2017 at 5:47 PM
Edited

, ,

Is there way to separate Analyzer / Tokenizer each for indexing and searching? It may also affect the result as well...

Yasu

Yasuyuki TakeoNovember 2, 2017 at 8:02 AM

I just tested b,c and f. I found out these 3 cases aren't a bug, the design of Kuromoji.

I checked how Elasticsearch returns response for each query, Liferay actually reflect the correct data that Elasticsearch returns.

The reason why the behavior is how Kuromoji's designed is because the morphological engine tokenizes words based on certain algorithms and embedded dictionary.

As developers working on tuning Japanese search, usually have to tune boost, adding certain words into the dictionary or using multiple index fields and combine them for meeting requirements for each case at projects.

In terms of case b,

the words "あいうえお" means kind of "abcde" in English, which doesn't mean nothing. So the result, they are tokenized "あい" and "うえ"、which make sense based on Japanese. If you really want to highlight the whole "あいうえお" as one word, then you may want to add the "あいうえお" as a noun.

In terms of case c,

I looked into analyzer as

then only あい and お are tokenized.(Kuromoji generalize Kanji and HIragana all to Katakana, so the アイ and あい、オ and お are same.

but あいう is tokenized as

So it doesn't match. This also makes sense.

In terms of case f,

推進部 is tokenized as follows

Again, the Kanji are translated into Katakana. it looks like 推進(スイシン) and 部(ブ) are tokenized as each words. So it makes sense that only 部 also hits.

Yasuyuki TakeoOctober 31, 2017 at 5:25 PM

and are currently working on manual testing with new mappings according to 's request.

Yasuyuki TakeoOctober 11, 2017 at 8:41 AM

,
I just created a pull request to , https://github.com/BryanEngler/liferay-portal/pull/220.
Please let me know if something missing.

Jordi RodóSeptember 13, 2017 at 11:12 PM

Hey Tibor,

Just playing with ES & kuromoji may reveal that the problem of "incorrect" highlighting is caused by how text gets tokenized based on the current mapping and settings:

Looks like morphological analysis is being performed. Did the same in Solr/Kuromoji and looks like we get the same results.

(see attached aiueo.png, "text" and "baseForm" rows).

Which makes me wonder if we should check whether Liferay is displaying and highlighting as specified by the search engine only and leave the accuracy to search engine tuning.

Regards,
Jordi

All work

Search does not provide accurate results and highlights for Japanese content neither on Elasticsearch nor on Solr - DM

Description

Steps to Reproduce - master/7.0.x

Searching & Results

Attachments

Linked work items

depends on

fixes

Details

Assignee

Reporter

Labels

Epic/Theme

Fix Priority

Development End Date

Components

Affects versions

Priority

Zendesk SupportLinked Tickets

Zendesk Support

Activity

Yasuyuki TakeoNovember 7, 2017 at 5:47 PMEdited

Yasuyuki TakeoNovember 2, 2017 at 8:02 AM

In terms of case b,

In terms of case c,

In terms of case f,

Yasuyuki TakeoOctober 31, 2017 at 5:25 PM

Yasuyuki TakeoOctober 11, 2017 at 8:41 AM

Jordi RodóSeptember 13, 2017 at 11:12 PM

Zendesk Support

Yasuyuki TakeoNovember 7, 2017 at 5:47 PM
Edited