-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LFX Proposal: Multimodal Large Model Joint Learning Algorithm: Reproduction Based on KubeEdge-Ianvs #123 #163
base: main
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Welcome @aryan0931! It looks like this is your first PR to kubeedge/ianvs 🎉 |
|
||
**Implementation Detail** | ||
```plaintext | ||
├── testcasecontroller |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that revisions to the core of ianvs, including controllers, are usually about adding new algorithm schemes, e.g., creating a scheme for lifelong learning.
In this proposal, since single-task learning exists for large models, my suggestion is to consider adding examples as a priority, i.e., a new example of single-task learning, instead of changing the core of ianvs. That can also release the burden of implementation and review, by avoiding the impact on other examples, without ianvs core revision.
│ │ ├── base.py # Base class for algorithms | ||
│ │ └── single_task_learning.py # Single-task learning algorithms | ||
│ │ └── clip_model.py # Implementation of the CLIP model | ||
│ ├── data_collection |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recently, multi-modal data types are mostly supported. We might make better use of current data types, especailly under limited development time. Please refer to detailed comments on dataset handling below.
│ │ ├── __init__.py | ||
│ │ ├── multimodal_interface.py # Interface for multimodal data collection | ||
│ │ └── preprocess.py # Preprocessing for text, audio, and images | ||
│ ├── benchmark |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Benchmarks like metrics should be in examples instead of controllers. There are cases that metrics of the same name have different implementations in different scenarios, e.g., F1-score, BWT, etc.
│ ├── tests | ||
│ │ ├── __init__.py | ||
│ │ └── test_benchmark.py # Unit tests for benchmarking | ||
│ └── main.py # Entry point for running the benchmark suite |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"main.py" is forbidden in the core of ianvs and is not safe enough even as an example of ianvs.
As in the quick start of ianvs, we recommend launching ianvs using its command lines,
ianvs -f ./examples/pcb-aoi/singletask_learning_bench/benchmarkingjob.yaml
|
||
**Adding New Enums in `DatasetFormat`:** | ||
```python | ||
class DatasetFormat(Enum): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Structured datasets are constructed using .csv. For Unstructured Data,
- datasets of image and audio are constructed using data index, i.e., URL with .txt.
- datasets of natural language are constructed using .jsonl
In the current stage, it is not a good idea to add more data types that need to change codes in sedna before ianvs. My suggestion is to make better use of the current implementation.
For your reference,
- Unstructured Data implementation using .txt:
- An image example is ready in Sedna federated learning
- Video-input examples are ready in Sedna incremental learning and joint inference.
- Unstructured Data implementation using .jsonl:
- An NLP example from @IcyFeather233 is available in ianvs Single task learning for LLM with proposal and implementation.
When necessary, @aryan0931 might refer to @IcyFeather233 for more usage information on data types of ianvs LLM benchmarks. The implementation from @IcyFeather233 has already been successfully used in several members' projects merged in ianvs recently.
paradigms: [ "all" ] # Selects all paradigms | ||
modules: [ "all" ] # Selects all modules | ||
hyperparameters: [ "all" ] # Selects all hyperparameters | ||
metrics: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned above, metrics should be implemented in examples to avoid impacts on others examples.
The usage of metrics is also in ianvs examples with testenv.yaml. An example is available in ianvs documents, as the following.
# testenv.yaml
testenv:
...
# metric used for model evaluation
model_metric:
# metric name; string type;
name: "f1_score"
# the url address of python file
url: "./examples/pcb-aoi/incremental_learning_bench/testenv/f1_score.py"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We see a DCO issue, which means the author of this commit failed to include a Signed-off-by line in the commit message.
Rebase is needed to fix this issue, see this link for more information
What type of PR is this?
/kind design
What this PR does / why we need it:
Proposal for LFX Project CNCF - Multimodal Large Model Joint Learning Algorithm: Reproduction Based on KubeEdge-Ianvs
Which issue(s) this PR fixes:
Fixes #123