-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add coordination ruler #13337
base: master
Are you sure you want to change the base?
add coordination ruler #13337
Conversation
Thanks! Really excited to have something like this in the library.
I think one construction that will be especially useful to people is coordination of modifiers in noun phrases. This could be coordination of adjectives, or nouns themselves. Section 2.2 of this thesis has a nice background on one type of construction that will be important to think about, compound nouns: https://www.researchgate.net/profile/Mark-Lauer-2/publication/2784243_Designing_Statistical_Language_Learners_Experiments_on_Noun_Compounds/links/53f9ccf60cf2e3cbf5604ec4/Designing-Statistical-Language-Learners-Experiments-on-Noun-Compounds.pdf In general we'd like to detect and process stuff like "green and red apples" into "green apples" and "red apples". But we can have deeper nesting than that: stuff like "hot and cold chicken soup", which ends up as "hot chicken soup" and "cold chicken soup". Ultimately we're going to trust the tree structure in the parser (which isn't always fantastic on these things, due to limitations in the training data annotation) but we still want to have some concept of the range of tree shapes so we can make the test cases for them. I would suggest first focussing on the cases where we have coordination inside a noun phrase. These will be the ones most useful for entity recognition. If we can enumerate the main construction cases we want to cover, we can then put together the target trees for them, and then test for those. For the tests, we definitely want to specify the dependency parse as part of the test case rather than letting it be predicted by the model. This way the test describes the tree, and also if we have different versions of the model the test doesn't break because it predicted something unexpected.
Yes the extensibility is definitely good. Arguably we also want to support matcher or dependency matcher patterns directly, but this could be done via a function that takes the patterns as an argument. |
spacy/pipeline/coordinationruler.py
Outdated
from typing import List, Callable, Optional, Union | ||
from pydantic import BaseModel, validator | ||
import re | ||
import en_core_web_sm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll want to find another solution for this, because we don't want to enforce all users to have exactly this model in their environment
from ..tokens import Doc | ||
from ..language import Language | ||
from ..vocab import Vocab | ||
from .pipe import Pipe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you run isort
on all files? (the test suite will fail otherwise)
spacy/pipeline/coordinationruler.py
Outdated
def split_noun_coordination(doc: Doc) -> Union[List[str], None]: | ||
"""Identifies and splits phrases with multiple nouns, a modifier | ||
and a conjunction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI @honnibal
return spacy.blank("en") | ||
|
||
|
||
### CONSTRUCTION CASES ### |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't account for i.e. the "water and power meters and electrical sockets"
Description
This PR adds two files:
spacy/pipeline/coordinationruler.py
: This file contains 3 simple coordination splitting rules and acoordination_splitter
factory that allows user to add this as a pipe to use the default splitting rules or add their own.spacy/tests/pipeline/test_coordinationruler.py
: This file contains tests associated to each method for theCoordinationSplitter
class.It does NOT include anything for documentation as this will be added after the PR is more finalised.
A few questions:
skill spans
. Should I add additional generalisable splitting rules? There is also a very specific skill splitting function i.e. the tokenskill
must be at the end of phrase.Checklist