Skip to content

Commit 2368540

Browse files
authored
Merge v0.6 (#24)
1 parent 734295c commit 2368540

File tree

83 files changed

+3931
-988
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

83 files changed

+3931
-988
lines changed

.gitignore

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
__pycache__
2+
*.egg-info
3+
*.so
4+
5+
.vs/
6+
.vscode/
7+
8+
.tox/
9+
.coverage
10+
.coverage.*
11+
htmlcov/
12+
13+
benchmark/megatron/Megatron-LM
14+
benchmark/deepspeed/Megatron-DeepSpeed
15+
16+
gencode*.py
17+
fullmodel.pt
18+
fullmodel.pt.*
19+
dist_param_map.pt
20+
21+
docs/build/
22+
23+
## autodist ##
24+
25+
# Python cache
26+
*.pyc
27+
dist
28+
.cache
29+
*env
30+
31+
# Generated by Cube
32+
gencode*
33+
*.pt
34+
35+
# Other
36+
shelf
37+
*.iml
38+
*.xml
39+
40+
# cppimport generated file
41+
.rendered.*.cpp
42+
.nnscaler/

README.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,8 @@ nnScaler is a parallelization engine that compiles a Deep neural network (DNN) m
1313

1414
# Latest News
1515
nnScaler (also known as CUBE as code name) has been adopted by multiple product and research projects, this section includes some of the latest news from the team and partner projects.
16-
* **2024-11-26** nnScaler 0.5 released: https://github.com/microsoft/nnscaler/releases/tag/0.5
16+
* **2025-01-08** nnScaler 0.6 released: https://github.com/microsoft/nnscaler/releases/tag/0.6
17+
* **2024-10-07** Diff-Transformer utilizes nnScaler for differential attention mechanism: [DIFFERENTIAL TRANSFORMER](https://arxiv.org/abs/2410.05258)
1718
* **2024-05-09** YOCO utilizes nnScaler for long-sequence training: [(YOCO)You only cache once: Decoder-decoder architectures for language models](https://arxiv.org/abs/2405.05254)
1819
* **2024-04-22** Post training for the long context version of [Phi-3 series](https://arxiv.org/abs/2404.14219)
1920
* **2024-02-21** LongRoPE utilizes nnScaler to reduce both the training and inference costs: [LongRoPE: Extending LLM context window beyond 2 million tokens](https://arxiv.org/abs/2402.13753)
@@ -41,7 +42,7 @@ For **_DNN system experts_**, they can leverage nnScaler to explore new DNN para
4142

4243
Install the following packages before the installation of nnScaler:
4344

44-
Python >= 3.8, < 3.11 (3.10 is recommanded)
45+
Python >= 3.9, < 3.11 (3.10 is recommanded)
4546

4647
PyTorch >= 2.0, < 2.4 (2.2.0 is recommanded)
4748

@@ -75,7 +76,7 @@ Obtain access of Llama-3 model from [HuggingFace](https://huggingface.co/meta-ll
7576

7677
### Code Changes for Parallelization
7778

78-
You can find all the example code at `examples/llama3_8B_128K`. As shown below, a user needs to:
79+
You can find all the example code at `examples/llama`. As shown below, a user needs to:
7980
* Wrap the Model: Include loss computation and other necessary components.
8081
* Configure Components: Set up the model, optimizer, and dataloader.
8182
* Initialize and Start: In the main function, create an nnScaler trainer with the above configurations and start the training process.
@@ -135,7 +136,7 @@ def main(args):
135136
Then we can start the example, and all the parallelization tasks will be finished by nnScaler automatically.
136137

137138
```shell
138-
cd examples/llama3_8B_128K
139+
cd examples/llama
139140

140141
# prepare training data:
141142
python bookcorpus.py --data_path_or_name bookcorpus/bookcorpus --tokenizer_path_or_name meta-llama/Meta-Llama-3-8B-Instruct --save_path ./bookcorpus_llama3_4K --sequence_length 4096

dev.md

Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
# Development Guide
2+
3+
## Code style
4+
5+
We follow [Google Style Python Docstring](https://google.github.io/styleguide/pyguide.html) for development.
6+
7+
Following is an typical example:
8+
9+
```python
10+
class SampleClass:
11+
"""Summary of class here.
12+
13+
Longer class information...
14+
Longer class information...
15+
16+
"""
17+
18+
def __init__(self, likes_spam: bool = False):
19+
"""Initializes the instance based on spam preference.
20+
21+
Args:
22+
likes_spam: Defines if instance exhibits this preference.
23+
"""
24+
self.likes_spam = likes_spam
25+
self.eggs = 0
26+
27+
def public_method(self, a, b):
28+
"""Performs operation blah.
29+
30+
Long description here.
31+
32+
Args:
33+
a (int): xxx
34+
b (int/str): xxx
35+
36+
Returns:
37+
t (bool): xxx
38+
k (int): xxx
39+
"""
40+
# function implementation goes here
41+
```
42+
43+
## Run unit tests
44+
45+
We use `tox` to run unit tests. You should install `tox` in your development environemnt
46+
```
47+
pip install tox
48+
```
49+
Currently we only use python3.10 to run tests. If you don't have python3.10 in your system, you can use conda. After conda is installed, you should install tox conda plugin by running
50+
```
51+
pip install tox-conda
52+
```
53+
After tox is ready, you can run all the unit test by running
54+
```
55+
tox
56+
```
57+
Please note tox will reuse the same virtual environment which is initialized by installing all packages listed in `requirements.txt` and `requirements-dev.txt`. If any of above files are modified, you should re-create virtual environment by running
58+
```
59+
tox -r
60+
```
61+
62+
To run a single unit test task during development, you can run
63+
64+
```
65+
pytest tests/your_test_file.py
66+
```
67+
68+
### Unit tests in AzureDevops pipeline
69+
70+
We use AzureDevops to run unit tests before you can merge your PR to main branch. You can find the pipeline definition in `azure-pipelines.yml`.
71+
72+
Please note that in AzureDevops pipeline agent, no gpu is available. So you must make sure your unit tests can run on cpu to pass the CI. Two options are available:
73+
1. Use `@replace_all_device_with('cpu')` decorator to replace all devices with cpu. Please refer to other tests for example.
74+
2. Mark your test case only work on gpu by using `@pytest.mark.skipif(not torch.cuda.is_available(), reason='lack of gpu devices')` decorator. Please refer to existing tests for example.
75+
76+
Before you push your code, please run tests at least on GPU machines to make sure all tests can pass. GPU test cases can't be run in AzureDevops pipeline. Of course, it would be better if you can run all tests on both GPU and CPU machines.
77+
78+
### Run unit tests in vscode
79+
80+
VS Code has a great support to unit tests. You can run/debug every tests easily in VS Code. Please refer to this document to set up your environment https://code.visualstudio.com/docs/python/testing
81+
82+
Another trick is, if you want to step into pakcage source code, you can add the following config to your .vscode/launch.json:
83+
```
84+
{
85+
"name": "Debug Unit Test",
86+
"type": "python",
87+
"request": "test",
88+
"justMyCode": false,
89+
},
90+
```
91+
92+
### Write Unit Tests
93+
1. If you need to use torchrun, please refer to `unit_test/launch_torchrun.py`, and you can find examples in `unit_tests/runtime/test_runtime_collectives.py`. Please note that `torchrun` is very slow, you should reduce its usage as possible.
94+
2. If you want to mock up any functions/methods, please use pytest-mock.
95+
3. **NOTE**: The name of test files and test functions must start with `test_`

docs/source/dimops.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -96,7 +96,7 @@ During policy decsion, user can see the operator and its name is 'matmul_custom'
9696
def PAS(graph: IRGraph, resource):
9797
for node in graph.nodes():
9898
if node.name == 'matmul_custom':
99-
algo = node.algorithms('dim')
99+
algo = node.algorithm('dim')
100100
# partition kd+
101101
config = dict(idx=0, dim=1, num=resource.ngpus)
102102
subnodes = graph.partition(node, algo, **config)

docs/source/installation.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ The wheel package is hosted on `GitHub release <https://github.com/microsoft/nns
1212

1313
.. code-block:: bash
1414
15-
pip install https://github.com/microsoft/nnscaler/releases/download/0.5/nnscaler-0.5-py3-none-any.whl
15+
pip install https://github.com/microsoft/nnscaler/releases/download/0.6/nnscaler-0.6-py3-none-any.whl
1616
1717
************************
1818
Install from Source Code

docs/source/quickstart.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ nnScaler can be installed from GitHub:
1010

1111
.. code-block:: bash
1212
13-
pip install https://github.com/microsoft/nnscaler/releases/download/0.5/nnscaler-0.5-py3-none-any.whl
13+
pip install https://github.com/microsoft/nnscaler/releases/download/0.6/nnscaler-0.6-py3-none-any.whl
1414
1515
# You may also want to clone the repo to try out the examples
1616
git clone --recursive https://github.com/microsoft/nnscaler

docs/source/troubleshooting.rst

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -114,6 +114,8 @@ Run the following command: ::
114114

115115
python -c 'import os,sys,nnscaler,cppimport.import_hook ; sys.path.append(os.path.dirname(nnscaler.__path__[0])) ; import nnscaler.autodist.dp_solver'
116116

117+
If it complains ``GLIBCXX_x.y.z`` not found, check the next issue.
118+
117119
Example stacktrace: ::
118120

119121
Traceback (most recent call last):
@@ -141,6 +143,25 @@ Example stacktrace: ::
141143
import nnscaler.autodist.dp_solver as dp_solver
142144
ModuleNotFoundError: No module named 'nnscaler.autodist.dp_solver'
143145

146+
"ImportError: ...... libstdc++.so.6: version \`GLIBCXX_x.y.z' not found"
147+
-------------------------------------------------------------------------
148+
149+
This is caused by gcc and glibc version mismatch.
150+
Typically it means it's using the system gcc and conda's glibc.
151+
152+
You can remove conda's glibc to force it use system glibc: ::
153+
154+
rm <PATH_TO_CONDA_ENV>/lib/libstdc++.so.6
155+
156+
The path is shown in the error message.
157+
158+
Example stacktrace: ::
159+
160+
$ python -c 'import nnscaler,cppimport.import_hook ; import nnscaler.autodist.dp_solver'
161+
Traceback (most recent call last):
162+
File "<string>", line 1, in <module>
163+
ImportError: /home/user/miniconda3/envs/user/bin/../lib/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by .../nnscaler/autodist/dp_solver.cpython-310-x86_64-linux-gnu.so)
164+
144165
Incorrect Usages
145166
================
146167

examples/customized_ops/ring_attention/test_ring_attn.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,7 @@ def policy(graph: IRGraph, resource: ComputeConfig) -> IRGraph:
6262
if not partitioned and node.signature == 'ring_attn.wrap_ring_attn_func':
6363
print('Partitioned node: ', node)
6464
sub_nodes = graph.partition(
65-
node, node.algorithms('dim'), idx=0, dim=1, num=ngpus)
65+
node, node.algorithm('dim'), idx=0, dim=1, num=ngpus)
6666
partitioned = True
6767
else:
6868
sub_nodes = graph.replicate(node, times=ngpus)

examples/customized_ops/ring_attention/test_zigzag_attn.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,7 @@ def policy(graph: IRGraph, resource: ComputeConfig) -> IRGraph:
6262
if not partitioned and node.signature == 'zigzag_attn.wrap_zigzag_attn_func':
6363
print('Partitioned node: ', node)
6464
sub_nodes = graph.partition(
65-
node, node.algorithms('dim'), idx=0, dim=1, num=ngpus)
65+
node, node.algorithm('dim'), idx=0, dim=1, num=ngpus)
6666
partitioned = True
6767
else:
6868
sub_nodes = graph.replicate(node, times=ngpus)

examples/llama/README.rst

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -162,7 +162,7 @@ If the profiling is skipped, the system will use MI250's data by default. You ca
162162

163163
.. code-block:: bash
164164
165-
cd nnscaler && python utility/prim_profiler.py
165+
torchrun --nnodes=<X> --nproc_per_node=<Y> -m nnscaler.profiler.benchmark_comm
166166
167167
Checkpoint
168168
==========
@@ -288,9 +288,9 @@ For example, you can use the following command to prepare data and train a small
288288
289289
# prepare data
290290
python bookcorpus.py --data_path_or_name bookcorpus/bookcorpus --tokenizer_path_or_name meta-llama/Meta-Llama-3-8B-Instruct --save_path ./bookcorpus_llama3_4K --sequence_length 4096
291-
291+
292292
# build the mini model
293293
python create_mini_model.py --model_id meta-llama/Meta-Llama-3-8B-Instruct --output_id ./llama3_mini
294-
294+
295295
# compile and run using data parallelism + zero1
296296
torchrun --nproc_per_node=2 train.py --plan_ngpus 1 --runtime_ngpus 2 --name llama3_debug --model_id ./llama3_mini --dataset_path ./bookcorpus_llama3_4K

0 commit comments

Comments
 (0)