medspacy.common.regex_matcher
RegexMatcher
The RegexMatcher is an alternative to spaCy's native Matcher and PhraseMatcher classes and allows matching based on typical regular expressions over the underlying doc text rather than spacy token attributes.
This can be useful for allowing more traditional text matching methods, but can lead to issues if the matched spans
in the text do not line up with spacy token boundaries. In this case, the RegexMatcher will by default resolve to
the nearest token boundaries by expanding to the left and right. This behavior can be configured using
resolve_start and resolve_end. To avoid this, consider using a list of dicts, such as in a spacy Matcher.
For more information, see: https://spacy.io/usage/rule-based-matching
Examples of resolve_start/resolve_end: In the string 'SERVICE: Radiology' the pattern 'ICE: Rad' would match in the middle of the tokens 'SERVICE' and 'RADIOLOGY'. SpaCy would normally return None. The RegexMatcher will expand in the following ways: resolve_start='left': The resulting span will start at 'SERVICE' -> 'SERVICE: Radiology' resolve_start='right': The resulting span will start at ':' -> ': Radiology' resolve_end='left': The resulting span will end at ':': -> 'SERVICE:' resolve_end='right': The resulting span will end at 'RADIOLOGY' -> 'SERVICE: Radiology'
Source code in medspacy/common/regex_matcher.py
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 | |
__call__(doc)
Call the RegexMatcher on a spaCy Doc.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
doc
|
Doc
|
The spaCy doc to process. |
required |
Returns:
| Type | Description |
|---|---|
List[Tuple[int, int, int]]
|
The list of match tuples (match_id, start, end). |
Source code in medspacy/common/regex_matcher.py
105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 | |
__init__(vocab, flags=re.IGNORECASE, resolve_start='left', resolve_end='right')
Creates a new RegexMatcher.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
vocab
|
Vocab
|
A spaCy model vocabulary |
required |
flags
|
RegexFlag
|
Regular expression flag. Default re.IGNORECASE |
IGNORECASE
|
resolve_start
|
str
|
How to resolve if the start character index of a match does not align with spacy token boundaries. If 'left', will find the nearest token boundary to the left of the unmatched character index, leading to a longer than expected span. If 'right', will find the nearest token boundary to the right of the unmatched character index, leading to a shorter than expected span. Default 'left'. |
'left'
|
resolve_end
|
str
|
How to resolve if the end character index of a match does not align with spacy token boundaries. If 'left', will find the nearest token boundary to the left of the unmatched character index, leading to a shorter than expected span. If 'right', will find the nearest token boundary to the right of the unmatched character index, leading to a longer than expected span. Default 'right'. |
'right'
|
Source code in medspacy/common/regex_matcher.py
37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 | |
add(match_id, regex_rules, on_match=None)
Add a rule with one or more regex patterns to one match id.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
match_id
|
str
|
The name of the pattern. |
required |
regex_rules
|
Iterable[str]
|
The list of regex strings to associate with |
required |
on_match
|
Optional[Callable[[Matcher, Doc, int, List[Tuple[int, int, int]]], Any]]
|
An optional callback function or other callable which takes 4 arguments: |
None
|
Source code in medspacy/common/regex_matcher.py
68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 | |