medspacy.common.util
This module will contain helper functions and classes for common clinical processing tasks which will be used in medspaCy's matcher objects.
get_token_for_char(doc, char_idx, resolve='left')
Get the token index that best matches a particular character index. Because regex find returns a character index and spaCy matches must align with token boundaries, each character index must be converted into a token index.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
doc
|
Doc
|
The spaCy Doc to search in. |
required |
char_idx
|
int
|
The character index to find the corresponding token for. |
required |
resolve
|
str
|
The resolution type. "left" will snap character to the token index to the left which precede the |
'left'
|
Returns:
| Type | Description |
|---|---|
Union[Token, None]
|
The token that best fits the character index based on the resolution type. |
Source code in medspacy/common/util.py
50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 | |
matches_to_spans(doc, matches, set_label=True)
Converts all identified matches to spans.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
doc
|
Doc
|
The spaCy doc corresponding to the matches. |
required |
matches
|
List[Tuple[int, int, int]]
|
The list of match Tuples (match_id, start, end). |
required |
set_label
|
bool
|
Whether to assign a label to the span based off the source rule. Default is True. |
True
|
Returns:
| Type | Description |
|---|---|
List[Span]
|
A list of spacy spans corresponding to the input matches. |
Source code in medspacy/common/util.py
163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 | |
overlaps(a, b)
Checks whether two match Tuples out of spacy matchers overlap.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
a
|
Tuple[int, int, int]
|
A match Tuple (match_id, start, end). |
required |
b
|
Tuple[int, int, int]
|
A match Tuple (match_id, start, end). |
required |
Returns:
| Type | Description |
|---|---|
bool
|
Whether the tuples overlap. |
Source code in medspacy/common/util.py
147 148 149 150 151 152 153 154 155 156 157 158 159 160 | |
prune_overlapping_matches(matches, strategy='longest')
Prunes overlapping matches from a list of spaCy match tuples (match_id, start, end).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
matches
|
List[Tuple[int, int, int]]
|
A list of match tuples of form (match_id, start, end). |
required |
strategy
|
str
|
The pruning strategy to use. At this time, the only available option is "longest" and will keep the longest of any two overlapping spans. Other behavior will be added in a future update. |
'longest'
|
Returns:
| Type | Description |
|---|---|
List[Tuple[int, int, int]]
|
The pruned list of matches. |
Source code in medspacy/common/util.py
95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 | |
span_contains(span, target, regex=True, case_insensitive=True)
Return True if a Span object contains a target phrase.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
span
|
Union[Doc, Span]
|
A spaCy Doc or Span, such as an entity in doc.ents |
required |
target
|
str
|
A target phrase or iterable of phrases to check in span.text.lower(). |
required |
regex
|
bool
|
Whether to search the span using a regular expression rather than a literal string. Default is True. |
True
|
case_insensitive
|
bool
|
Whether the matching is case-insensitive. Default is True. |
True
|
Source code in medspacy/common/util.py
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | |