-
Notifications
You must be signed in to change notification settings - Fork 480
#34201 MCP server update Search #34267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…t will list the contents of a folder when queried. It will then load the raw json into the context (optional) to be used by the MCP server for further sorting and filtering.
…use the new Content Drive API. Updated the LLM search description with new directives.
…drive API instead of having the MCP server create lucene queries. REF: #34201
| this.serviceLogger.error('Invalid drive search parameters', validated.error); | ||
| throw new Error( | ||
| 'Invalid search parameters: ' + JSON.stringify(validated.error.format()) | ||
| 'Invalid drive search parameters: ' + JSON.stringify(validated.error.format()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure we want to call this "drive search" or just search or "asset search"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I called it drive because that is the name of the endpoint we are hitting for the search. I can change it to whatever you'd like.
| - filters.filterFolders (boolean) | ||
| If true, excludes folders from results. | ||
| "+Blog.body:("business" && "Apple")" | ||
| - showFolders (boolean) | ||
| If true, explicitly includes folders in results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WHAT! if this endpoints works like this we need to fix it, we can't have 2 params for include/exclude folder...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is actually how the endpoint works -- Sample search from the interface below:
{
"assetPath": "//BankTech/",
"includeSystemHost": true,
"filters": {
"text": "",
"filterFolders": true
},
"contentTypes": [
"FileAsset",
"htmlpageasset"
],
"baseTypes": [
"FILEASSET",
"HTMLPAGE"
],
"offset": 0,
"maxResults": 20,
"sortBy": "modDate:desc",
"archived": false,
"showFolders": false
}
| ---------------------------------------- | ||
| Use square brackets with TO for a value range. Use the strict date format "yyyyMMddHHmmss" for dates. | ||
| Returns the raw Drive Search API response. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think is confusing that we call this "Drive" search... we need to be consistent... asset search or something and tell explicity the LLM what an "asset" is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whhaaattttt, continuity? Crazy talk.
| - If results include multiple baseTypes (HTMLPAGE, FILEASSET, CONTENT, etc.), | ||
| group them by baseType unless the user asks otherwise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the grouping?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Group them for visual recognition. Without grouping, if you ask the LLM to list the assets/info it will throw them out willy-nilly.
| When writing dotCMS Lucene queries, always use the correct content type and field format, explicit operators, escape special characters, stick to the strict date format, and do not generate raw OpenSearch JSON unless you really need advanced features. | ||
| MENTAL MODEL: | ||
| The MCP server is the source of truth. | ||
| The agent's role is to faithfully expose its results, not reinterpret them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure we want this? because human might ask interpretation of the results like "how many contents do I have about XYZ topic" or stuff like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The question you pose will still work because it parses the text to find the results. Without telling it to not reinterpret them, it hallucinates results, or gives responses it 'thinks' you want rather then using the data presented to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to find the example that got me to this instruction, but it is no longer available. Basically I told the search to tell me every file that contained the word "Adobe" in it. The LLM interpreted that to mean I just wanted to know the 'Pages' that had the word "Adobe" and so only returned the list of 5 or so pages, and none of the assets even though it had them in the temp file.
| * Drive Search response schema | ||
| * Matches the structure returned by /api/v1/drive/search | ||
| */ | ||
| export const DriveSearchItemSchema = z.record(z.string(), z.unknown()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Big no... drive/search return contentlets, we had contentlet schema, don't delete that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It returns a different (slightly by about 5 or so fields) schema than the contentlet schema. I generalized it because the LLM doesn't need exact mapping to work, it will read off the response json. If you want an exact schema returned, I can map the one returned by the content drive, or go back to the contentlet schema. The new drive schema does have some nice features like all the workflows, permissions, and metadata available to that contentlet/asset.
| "strict": true, | ||
| "esModuleInterop": true | ||
| "esModuleInterop": true, | ||
| "importHelpers": false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What this do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
esModuleInterop lets you import old JS libraries with new syntax just to make life a little easier.
importHelpers false so you don't need an extra dependency tslib - keep helpers in the compiled file instead of reaching out to somewhere else.
Can remove them if you'd rather.
Proposed Changes
Updated the MCP server to use the new content drive API instead of the LLM crafting lucene queries.
Closes: #34201
Checklist
Additional Info
Updates create a more structured input/output for the LLM to consume.
Screenshots