diff --git a/crowdsec-docs/docs/expr/file_helpers.md b/crowdsec-docs/docs/expr/file_helpers.md index e4b4b7d22..aef65a7b0 100644 --- a/crowdsec-docs/docs/expr/file_helpers.md +++ b/crowdsec-docs/docs/expr/file_helpers.md @@ -20,4 +20,43 @@ Returns the content of `FileName` as an array of string, while providing cache m Returns `true` if the `StringToMatch` is matched by one of the expressions contained in `FileName` (uses RE2 regexp engine). -> `RegexpInFile( evt.Enriched.reverse_dns, 'my_legit_seo_whitelists.txt')` \ No newline at end of file +> `RegexpInFile( evt.Enriched.reverse_dns, 'my_legit_seo_whitelists.txt')` + +## Map file helpers + +Map file helpers work with JSON-lines files loaded with `type: map` in the [data property](/log_processor/scenarios/format.md#data). Each line in the file is a JSON object with three required fields: + +- `pattern`: the value to match against +- `tag`: the label returned on match +- `type`: one of `equals` (exact match), `contains` (substring match), or `regex` (RE2 regular expression) + +Example map file: + +```json +{"pattern": "/wp-admin/", "tag": "WordPress", "type": "contains"} +{"pattern": "/specific/endpoint.php", "tag": "SpecificApp", "type": "equals"} +{"pattern": "/wp-content/plugins/[^/]+/readme\\.txt", "tag": "WordPress-Plugin", "type": "regex"} +``` + +Comments (lines starting with `#`) and blank lines are ignored. + +### `FileMap(FileName) []map[string]string` + +Returns the content of `FileName` as an array of maps. Each element is a map with the keys from the JSON object (`pattern`, `tag`, `type`). + +> `FileMap('app_signatures.json')` + +> `any(FileMap('app_signatures.json'), { #.tag == 'WordPress' })` + +### `LookupFile(StringToMatch, FileName) string` + +Searches the map file `FileName` for a match against `StringToMatch`. Returns the `tag` of the first matching entry, or an empty string if no match is found. + +Matching is performed in priority order: +1. **Exact match** (`equals` entries) — O(1) hash map lookup +2. **Substring match** (`contains` entries) — using [Aho-Corasick](https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm) automaton +3. **Regex match** (`regex` entries) — RE2 regular expressions + +> `LookupFile(evt.Parsed.request, 'app_signatures.json')` + +> `LookupFile(evt.Parsed.request, 'app_signatures.json') != ''` \ No newline at end of file diff --git a/crowdsec-docs/docs/log_processor/parsers/format.md b/crowdsec-docs/docs/log_processor/parsers/format.md index f81ace5e5..110aef98f 100644 --- a/crowdsec-docs/docs/log_processor/parsers/format.md +++ b/crowdsec-docs/docs/log_processor/parsers/format.md @@ -393,17 +393,18 @@ statics: data: - source_url: https://URL/TO/FILE dest_file: LOCAL_FILENAME - type: (regexp|string) + type: (regexp|string|map) ``` `data` allows user to specify an external source of data. This section is only relevant when `cscli` is used to install parser from hub, as it will download the `source_url` and store it to `dest_file`. When the parser is not installed from the hub, CrowdSec won't download the URL, but the file must exist for the parser to be loaded correctly. -The `type` is mandatory if you want to evaluate the data in the file, and should be `regex` for valid (re2) regular expression per line or `string` for string per line. +The `type` is mandatory if you want to evaluate the data in the file, and should be `regex` for valid (re2) regular expression per line, `string` for string per line, or `map` for JSON-lines map files (see [map file helpers](/expr/file_helpers.md#map-file-helpers)). + The regexps will be compiled, the strings will be loaded into a list and both will be kept in memory. Without specifying a `type`, the file will be downloaded and stored as file and not in memory. -You can refer to the content of the downloaded file(s) by using either the `File()` or `RegexpInFile()` function in an expression: +You can refer to the content of the downloaded file(s) by using `File()`, `RegexpInFile()`, `FileMap()`, or `LookupFile()` in an expression: ```yaml filter: 'evt.Meta.log_type in ["http_access-log", "http_error-log"] and any(File("backdoors.txt"), { evt.Parsed.request contains #})' diff --git a/crowdsec-docs/docs/log_processor/scenarios/format.md b/crowdsec-docs/docs/log_processor/scenarios/format.md index d7e8ff165..5e42120db 100644 --- a/crowdsec-docs/docs/log_processor/scenarios/format.md +++ b/crowdsec-docs/docs/log_processor/scenarios/format.md @@ -612,7 +612,7 @@ If the `cancel_on` expression returns true, the bucket is immediately destroyed data: - source_url: https://URL/TO/FILE dest_file: LOCAL_FILENAME - [type: (regexp|string)] + [type: (regexp|string|map)] ``` :::info @@ -628,12 +628,12 @@ The `source_url` section is only relevant when `cscli` is used to install scenar When the scenario is not installed from the hub, CrowdSec will not download the `source_url`, however, if the file exists at `dest_file` within the data directory it will still be loaded into memory. -The `type` is mandatory if you want to evaluate the data in the file, and should be `regex` for valid (re2) regular expression per line or `string` for string per line. +The `type` is mandatory if you want to evaluate the data in the file, and should be `regex` for valid (re2) regular expression per line, `string` for string per line, or `map` for JSON-lines map files (see [map file helpers](/expr/file_helpers.md#map-file-helpers)). The regexps will be compiled, the strings will be loaded into a list and both will be kept in memory. Without specifying a `type`, the file will be downloaded and stored as file and not in memory. -You can refer to the content of the downloaded file(s) by using either the `File()` or `RegexpInFile()` function in an expression: +You can refer to the content of the downloaded file(s) by using `File()`, `RegexpInFile()`, `FileMap()`, or `LookupFile()` in an expression: ```yaml filter: 'evt.Meta.log_type in ["http_access-log", "http_error-log"] and any(File("backdoors.txt"), { evt.Parsed.request contains #})'