Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LFX Proposal: Multimodal Large Model Joint Learning Algorithm: Reproduction Based on KubeEdge-Ianvs #123 #163

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

aryan0931
Copy link

What type of PR is this?
/kind design

What this PR does / why we need it:

Proposal for LFX Project CNCF - Multimodal Large Model Joint Learning Algorithm: Reproduction Based on KubeEdge-Ianvs

Which issue(s) this PR fixes:

Fixes #123

@kubeedge-bot kubeedge-bot added the kind/design Categorizes issue or PR as related to design. label Nov 12, 2024
@kubeedge-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign moorezheng after the PR has been reviewed.
You can assign the PR to them by writing /assign @moorezheng in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubeedge-bot
Copy link
Collaborator

Welcome @aryan0931! It looks like this is your first PR to kubeedge/ianvs 🎉

@kubeedge-bot kubeedge-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Nov 12, 2024

**Implementation Detail**
```plaintext
├── testcasecontroller
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that revisions to the core of ianvs, including controllers, are usually about adding new algorithm schemes, e.g., creating a scheme for lifelong learning.

In this proposal, since single-task learning exists for large models, my suggestion is to consider adding examples as a priority, i.e., a new example of single-task learning, instead of changing the core of ianvs. That can also release the burden of implementation and review, by avoiding the impact on other examples, without ianvs core revision.

│ │ ├── base.py # Base class for algorithms
│ │ └── single_task_learning.py # Single-task learning algorithms
│ │ └── clip_model.py # Implementation of the CLIP model
│ ├── data_collection
Copy link
Collaborator

@MooreZheng MooreZheng Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recently, multi-modal data types are mostly supported. We might make better use of current data types, especailly under limited development time. Please refer to detailed comments on dataset handling below.

│ │ ├── __init__.py
│ │ ├── multimodal_interface.py # Interface for multimodal data collection
│ │ └── preprocess.py # Preprocessing for text, audio, and images
│ ├── benchmark
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benchmarks like metrics should be in examples instead of controllers. There are cases that metrics of the same name have different implementations in different scenarios, e.g., F1-score, BWT, etc.

│ ├── tests
│ │ ├── __init__.py
│ │ └── test_benchmark.py # Unit tests for benchmarking
│ └── main.py # Entry point for running the benchmark suite
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"main.py" is forbidden in the core of ianvs and is not safe enough even as an example of ianvs.

As in the quick start of ianvs, we recommend launching ianvs using its command lines,
ianvs -f ./examples/pcb-aoi/singletask_learning_bench/benchmarkingjob.yaml


**Adding New Enums in `DatasetFormat`:**
```python
class DatasetFormat(Enum):
Copy link
Collaborator

@MooreZheng MooreZheng Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Structured datasets are constructed using .csv. For Unstructured Data,

  1. datasets of image and audio are constructed using data index, i.e., URL with .txt.
  2. datasets of natural language are constructed using .jsonl

In the current stage, it is not a good idea to add more data types that need to change codes in sedna before ianvs. My suggestion is to make better use of the current implementation.

For your reference,

  1. Unstructured Data implementation using .txt:
  1. Unstructured Data implementation using .jsonl:

When necessary, @aryan0931 might refer to @IcyFeather233 for more usage information on data types of ianvs LLM benchmarks. The implementation from @IcyFeather233 has already been successfully used in several members' projects merged in ianvs recently.

paradigms: [ "all" ] # Selects all paradigms
modules: [ "all" ] # Selects all modules
hyperparameters: [ "all" ] # Selects all hyperparameters
metrics:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned above, metrics should be implemented in examples to avoid impacts on others examples.

The usage of metrics is also in ianvs examples with testenv.yaml. An example is available in ianvs documents, as the following.

# testenv.yaml
testenv:
...

# metric used for model evaluation
model_metric:
  # metric name; string type;
  name: "f1_score"
  # the url address of python file
  url: "./examples/pcb-aoi/incremental_learning_bench/testenv/f1_score.py"

Copy link
Collaborator

@MooreZheng MooreZheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We see a DCO issue, which means the author of this commit failed to include a Signed-off-by line in the commit message.

Rebase is needed to fix this issue, see this link for more information

@MooreZheng MooreZheng requested review from MooreZheng and hsj576 and removed request for jaypume November 14, 2024 12:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/design Categorizes issue or PR as related to design. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multimodal Large Model Joint Learning Algorithm: Reproduction Based on KubeEdge-Ianvs
3 participants