Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add coordination ruler #13337

Draft
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

india-kerle
Copy link

@india-kerle india-kerle commented Feb 19, 2024

Description

This PR adds two files:

  • spacy/pipeline/coordinationruler.py: This file contains 3 simple coordination splitting rules and acoordination_splitter factory that allows user to add this as a pipe to use the default splitting rules or add their own.
  • spacy/tests/pipeline/test_coordinationruler.py: This file contains tests associated to each method for the CoordinationSplitter class.

It does NOT include anything for documentation as this will be added after the PR is more finalised.

A few questions:

  • I've expanded the initial splitting rules very slightly to be more generalisable to full sentences and not the original skill spans. Should I add additional generalisable splitting rules? There is also a very specific skill splitting function i.e. the token skill must be at the end of phrase.
  • I made this a factory as opposed to a function component because I thought it would be nice for users to be able to add their own custom rules - thoughts?

Checklist

  • I confirm that I have the right to submit this contribution under the project's MIT license.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

@india-kerle india-kerle marked this pull request as ready for review February 19, 2024 12:49
@honnibal
Copy link
Member

Thanks! Really excited to have something like this in the library.

I've expanded the initial splitting rules very slightly to be more generalisable to full sentences and not the original skill spans. Should I add additional generalisable splitting rules? There is also a very specific skill splitting function i.e. the token skill must be at the end of phrase.

I think one construction that will be especially useful to people is coordination of modifiers in noun phrases. This could be coordination of adjectives, or nouns themselves. Section 2.2 of this thesis has a nice background on one type of construction that will be important to think about, compound nouns: https://www.researchgate.net/profile/Mark-Lauer-2/publication/2784243_Designing_Statistical_Language_Learners_Experiments_on_Noun_Compounds/links/53f9ccf60cf2e3cbf5604ec4/Designing-Statistical-Language-Learners-Experiments-on-Noun-Compounds.pdf

In general we'd like to detect and process stuff like "green and red apples" into "green apples" and "red apples". But we can have deeper nesting than that: stuff like "hot and cold chicken soup", which ends up as "hot chicken soup" and "cold chicken soup". Ultimately we're going to trust the tree structure in the parser (which isn't always fantastic on these things, due to limitations in the training data annotation) but we still want to have some concept of the range of tree shapes so we can make the test cases for them.

I would suggest first focussing on the cases where we have coordination inside a noun phrase. These will be the ones most useful for entity recognition. If we can enumerate the main construction cases we want to cover, we can then put together the target trees for them, and then test for those. For the tests, we definitely want to specify the dependency parse as part of the test case rather than letting it be predicted by the model. This way the test describes the tree, and also if we have different versions of the model the test doesn't break because it predicted something unexpected.

I made this a factory as opposed to a function component because I thought it would be nice for users to be able to add their own custom rules - thoughts?

Yes the extensibility is definitely good. Arguably we also want to support matcher or dependency matcher patterns directly, but this could be done via a function that takes the patterns as an argument.

@svlandeg svlandeg added enhancement Feature requests and improvements feat / pipeline Feature: Processing pipeline and components labels Feb 22, 2024
from typing import List, Callable, Optional, Union
from pydantic import BaseModel, validator
import re
import en_core_web_sm
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll want to find another solution for this, because we don't want to enforce all users to have exactly this model in their environment

@svlandeg svlandeg marked this pull request as draft February 26, 2024 09:37
from ..tokens import Doc
from ..language import Language
from ..vocab import Vocab
from .pipe import Pipe
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you run isort on all files? (the test suite will fail otherwise)

Comment on lines 12 to 14
def split_noun_coordination(doc: Doc) -> Union[List[str], None]:
"""Identifies and splits phrases with multiple nouns, a modifier
and a conjunction.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @honnibal

return spacy.blank("en")


### CONSTRUCTION CASES ###
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't account for i.e. the "water and power meters and electrical sockets"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests and improvements feat / pipeline Feature: Processing pipeline and components
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants