microsoft
diff --git a/‎.gitignore‎
Lines changed: 42 additions & 0 deletions b/‎.gitignore‎
Lines changed: 42 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 5 additions & 4 deletions b/‎README.md‎
Lines changed: 5 additions & 4 deletions
diff --git a/‎dev.md‎
Lines changed: 95 additions & 0 deletions b/‎dev.md‎
Lines changed: 95 additions & 0 deletions
diff --git a/‎docs/source/dimops.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/source/dimops.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/installation.rst‎
Lines changed: 1 addition & 1 deletion b/‎docs/source/installation.rst‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/quickstart.rst‎
Lines changed: 1 addition & 1 deletion b/‎docs/source/quickstart.rst‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/troubleshooting.rst‎
Lines changed: 21 additions & 0 deletions b/‎docs/source/troubleshooting.rst‎
Lines changed: 21 additions & 0 deletions
diff --git a/‎examples/customized_ops/ring_attention/test_ring_attn.py‎
Lines changed: 1 addition & 1 deletion b/‎examples/customized_ops/ring_attention/test_ring_attn.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎examples/customized_ops/ring_attention/test_zigzag_attn.py‎
Lines changed: 1 addition & 1 deletion b/‎examples/customized_ops/ring_attention/test_zigzag_attn.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎examples/llama/README.rst‎
Lines changed: 3 additions & 3 deletions b/‎examples/llama/README.rst‎
Lines changed: 3 additions & 3 deletions
@@ -0,0 +1,42 @@
+__pycache__
+*.egg-info
+*.so
+
+.vs/
+.vscode/
+
+.tox/
+.coverage
+.coverage.*
+htmlcov/
+
+benchmark/megatron/Megatron-LM
+benchmark/deepspeed/Megatron-DeepSpeed
+
+gencode*.py
+fullmodel.pt
+fullmodel.pt.*
+dist_param_map.pt
+
+docs/build/
+
+## autodist ##
+
+# Python cache
+*.pyc
+dist
+.cache
+*env
+
+# Generated by Cube
+gencode*
+*.pt
+
+# Other
+shelf
+*.iml
+*.xml
+
+# cppimport generated file
+.rendered.*.cpp
+.nnscaler/
@@ -13,7 +13,8 @@ nnScaler is a parallelization engine that compiles a Deep neural network (DNN) m
 
 # Latest News
 nnScaler (also known as CUBE as code name) has been adopted by multiple product and research projects, this section includes some of the latest news from the team and partner projects.
-* **2024-11-26** nnScaler 0.5 released: https://github.com/microsoft/nnscaler/releases/tag/0.5
+* **2025-01-08** nnScaler 0.6 released: https://github.com/microsoft/nnscaler/releases/tag/0.6
+* **2024-10-07** Diff-Transformer utilizes nnScaler for differential attention mechanism: [DIFFERENTIAL TRANSFORMER](https://arxiv.org/abs/2410.05258)
 * **2024-05-09** YOCO utilizes nnScaler for long-sequence training: [(YOCO)You only cache once: Decoder-decoder architectures for language models](https://arxiv.org/abs/2405.05254)
 * **2024-04-22** Post training for the long context version of [Phi-3 series](https://arxiv.org/abs/2404.14219)
 * **2024-02-21** LongRoPE utilizes nnScaler to reduce both the training and inference costs: [LongRoPE: Extending LLM context window beyond 2 million tokens](https://arxiv.org/abs/2402.13753)
@@ -41,7 +42,7 @@ For **_DNN system experts_**, they can leverage nnScaler to explore new DNN para
 
 Install the following packages before the installation of nnScaler:
 
-    Python >= 3.8, < 3.11 (3.10 is recommanded)
+    Python >= 3.9, < 3.11 (3.10 is recommanded)
 
     PyTorch >= 2.0, < 2.4 (2.2.0 is recommanded)
 
@@ -75,7 +76,7 @@ Obtain access of Llama-3 model from [HuggingFace](https://huggingface.co/meta-ll
 
 ### Code Changes for Parallelization
 
-You can find all the example code at `examples/llama3_8B_128K`. As shown below, a user needs to:
+You can find all the example code at `examples/llama`. As shown below, a user needs to:
 * Wrap the Model: Include loss computation and other necessary components.
 * Configure Components: Set up the model, optimizer, and dataloader.
 * Initialize and Start: In the main function, create an nnScaler trainer with the above configurations and start the training process.
@@ -135,7 +136,7 @@ def main(args):
 Then we can start the example, and all the parallelization tasks will be finished by nnScaler automatically. 
 
 ```shell
-cd examples/llama3_8B_128K
+cd examples/llama
 
 # prepare training data:
 python bookcorpus.py --data_path_or_name bookcorpus/bookcorpus --tokenizer_path_or_name meta-llama/Meta-Llama-3-8B-Instruct --save_path ./bookcorpus_llama3_4K --sequence_length 4096
 
@@ -0,0 +1,95 @@
+# Development Guide
+
+## Code style
+
+We follow [Google Style Python Docstring](https://google.github.io/styleguide/pyguide.html) for development.
+
+Following is an typical example:
+
+```python
+class SampleClass:
+    """Summary of class here.
+
+    Longer class information...
+    Longer class information...
+
+    """
+
+    def __init__(self, likes_spam: bool = False):
+        """Initializes the instance based on spam preference.
+
+        Args:
+          likes_spam: Defines if instance exhibits this preference.
+        """
+        self.likes_spam = likes_spam
+        self.eggs = 0
+
+    def public_method(self, a, b):
+        """Performs operation blah.
+
+        Long description here.
+
+        Args:
+            a (int): xxx
+            b (int/str): xxx
+
+        Returns:
+            t (bool): xxx
+            k (int): xxx
+        """
+        # function implementation goes here
+```
+
+## Run unit tests
+
+We use `tox` to run unit tests. You should install `tox` in your development environemnt
+```
+pip install tox
+```
+Currently we only use python3.10 to run tests. If you don't have python3.10 in your system, you can use conda. After conda is installed, you should install tox conda plugin by running
+```
+pip install tox-conda
+```
+After tox is ready, you can run all the unit test by running
+```
+tox
+```
+Please note tox will reuse the same virtual environment which is initialized by installing all packages listed in `requirements.txt` and `requirements-dev.txt`. If any of above files are modified, you should re-create virtual environment by running
+```
+tox -r
+```
+
+To run a single unit test task during development, you can run
+
+```
+pytest tests/your_test_file.py
+```
+
+### Unit tests in AzureDevops pipeline
+
+We use AzureDevops to run unit tests before you can merge your PR to main branch. You can find the pipeline definition in `azure-pipelines.yml`.
+
+Please note that in AzureDevops pipeline agent, no gpu is available. So you must make sure your unit tests can run on cpu to pass the CI. Two options are available:
+1. Use `@replace_all_device_with('cpu')` decorator to replace all devices with cpu. Please refer to other tests for example.
+2. Mark your test case only work on gpu by using `@pytest.mark.skipif(not torch.cuda.is_available(), reason='lack of gpu devices')` decorator. Please refer to existing tests for example.
+
+Before you push your code, please run tests at least on GPU machines to make sure all tests can pass. GPU test cases can't be run in AzureDevops pipeline. Of course, it would be better if you can run all tests on both GPU and CPU machines.
+
+### Run unit tests in vscode
+
+VS Code has a great support to unit tests. You can run/debug every tests easily in VS Code. Please refer to this document to set up your environment https://code.visualstudio.com/docs/python/testing
+
+Another trick is, if you want to step into pakcage source code, you can add the following config to your .vscode/launch.json:
+```
+{
+    "name": "Debug Unit Test",
+    "type": "python",
+    "request": "test",
+    "justMyCode": false,
+},
+```
+
+### Write Unit Tests
+1. If you need to use torchrun, please refer to `unit_test/launch_torchrun.py`, and you can find examples in `unit_tests/runtime/test_runtime_collectives.py`. Please note that `torchrun` is very slow, you should reduce its usage as possible.
+2. If you want to mock up any functions/methods, please use pytest-mock.
+3. **NOTE**: The name of test files and test functions must start with `test_`
@@ -96,7 +96,7 @@ During policy decsion, user can see the operator and its name is 'matmul_custom'
 def PAS(graph: IRGraph, resource):
     for node in graph.nodes():
         if node.name == 'matmul_custom':
-            algo = node.algorithms('dim')
+            algo = node.algorithm('dim')
             # partition kd+
             config = dict(idx=0, dim=1, num=resource.ngpus)
             subnodes = graph.partition(node, algo, **config)
 
@@ -12,7 +12,7 @@ The wheel package is hosted on `GitHub release <https://github.com/microsoft/nns
 
 .. code-block:: bash
 
-    pip install https://github.com/microsoft/nnscaler/releases/download/0.5/nnscaler-0.5-py3-none-any.whl
+    pip install https://github.com/microsoft/nnscaler/releases/download/0.6/nnscaler-0.6-py3-none-any.whl
 
 ************************
 Install from Source Code
 
@@ -10,7 +10,7 @@ nnScaler can be installed from GitHub:
 
 .. code-block:: bash
 
-    pip install https://github.com/microsoft/nnscaler/releases/download/0.5/nnscaler-0.5-py3-none-any.whl
+    pip install https://github.com/microsoft/nnscaler/releases/download/0.6/nnscaler-0.6-py3-none-any.whl
 
     # You may also want to clone the repo to try out the examples
     git clone --recursive https://github.com/microsoft/nnscaler
 
@@ -114,6 +114,8 @@ Run the following command: ::
 
     python -c 'import os,sys,nnscaler,cppimport.import_hook ; sys.path.append(os.path.dirname(nnscaler.__path__[0])) ; import nnscaler.autodist.dp_solver'
 
+If it complains ``GLIBCXX_x.y.z`` not found, check the next issue.
+
 Example stacktrace: ::
 
     Traceback (most recent call last):
@@ -141,6 +143,25 @@ Example stacktrace: ::
         import nnscaler.autodist.dp_solver as dp_solver
     ModuleNotFoundError: No module named 'nnscaler.autodist.dp_solver'
 
+"ImportError: ...... libstdc++.so.6: version \`GLIBCXX_x.y.z' not found"
+-------------------------------------------------------------------------
+
+This is caused by gcc and glibc version mismatch.
+Typically it means it's using the system gcc and conda's glibc.
+
+You can remove conda's glibc to force it use system glibc: ::
+
+    rm <PATH_TO_CONDA_ENV>/lib/libstdc++.so.6
+
+The path is shown in the error message.
+
+Example stacktrace: ::
+
+    $ python -c 'import nnscaler,cppimport.import_hook ; import nnscaler.autodist.dp_solver'
+    Traceback (most recent call last):
+      File "<string>", line 1, in <module>
+    ImportError: /home/user/miniconda3/envs/user/bin/../lib/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by .../nnscaler/autodist/dp_solver.cpython-310-x86_64-linux-gnu.so)
+
 Incorrect Usages
 ================
 
 
@@ -62,7 +62,7 @@ def policy(graph: IRGraph, resource: ComputeConfig) -> IRGraph:
         if not partitioned and node.signature == 'ring_attn.wrap_ring_attn_func':
             print('Partitioned node: ', node)
             sub_nodes = graph.partition(
-                node, node.algorithms('dim'), idx=0, dim=1, num=ngpus)
+                node, node.algorithm('dim'), idx=0, dim=1, num=ngpus)
             partitioned = True
         else:
             sub_nodes = graph.replicate(node, times=ngpus)
 
@@ -62,7 +62,7 @@ def policy(graph: IRGraph, resource: ComputeConfig) -> IRGraph:
         if not partitioned and node.signature == 'zigzag_attn.wrap_zigzag_attn_func':
             print('Partitioned node: ', node)
             sub_nodes = graph.partition(
-                node, node.algorithms('dim'), idx=0, dim=1, num=ngpus)
+                node, node.algorithm('dim'), idx=0, dim=1, num=ngpus)
             partitioned = True
         else:
             sub_nodes = graph.replicate(node, times=ngpus)
 
@@ -162,7 +162,7 @@ If the profiling is skipped, the system will use MI250's data by default. You ca
 
 .. code-block:: bash
 
-    cd nnscaler && python utility/prim_profiler.py
+    torchrun --nnodes=<X> --nproc_per_node=<Y> -m nnscaler.profiler.benchmark_comm
 
 Checkpoint
 ==========
@@ -288,9 +288,9 @@ For example, you can use the following command to prepare data and train a small
 
     # prepare data
     python bookcorpus.py --data_path_or_name bookcorpus/bookcorpus --tokenizer_path_or_name meta-llama/Meta-Llama-3-8B-Instruct --save_path ./bookcorpus_llama3_4K --sequence_length 4096
-    
+
     # build the mini model
     python create_mini_model.py --model_id meta-llama/Meta-Llama-3-8B-Instruct --output_id ./llama3_mini
-    
+
     # compile and run using data parallelism + zero1
     torchrun --nproc_per_node=2 train.py --plan_ngpus 1 --runtime_ngpus 2 --name llama3_debug --model_id ./llama3_mini --dataset_path ./bookcorpus_llama3_4K