Metadata-Version: 2.4
Name: neuronx-distributed-inference
Version: 0.9.17334+ced6ae4e
Summary: NeuronxDistributedInference
Keywords: aws neuron
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Education
Classifier: Intended Audience :: Science/Research
Classifier: License :: Other/Proprietary License
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Mathematics
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development
Classifier: Topic :: Software Development :: Libraries
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.7
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: neuronx_distributed
Requires-Dist: torch_neuronx==2.9.*
Requires-Dist: transformers==4.57.*
Requires-Dist: huggingface-hub
Requires-Dist: sentencepiece
Requires-Dist: torchvision
Requires-Dist: pillow
Requires-Dist: blobfile
Requires-Dist: hf-xet<2.0.0,>=1.1.10
Provides-Extra: test
Requires-Dist: pytest; extra == "test"
Requires-Dist: pytest-forked; extra == "test"
Requires-Dist: pytest-cov; extra == "test"
Requires-Dist: pytest-xdist; extra == "test"
Requires-Dist: pytest-rerunfailures==15.1; extra == "test"
Requires-Dist: accelerate; extra == "test"
Requires-Dist: diffusers==0.32.0; extra == "test"
Requires-Dist: openai-whisper==20250625; extra == "test"
Provides-Extra: flux
Requires-Dist: accelerate; extra == "flux"
Requires-Dist: diffusers==0.32.0; extra == "flux"
Provides-Extra: whisper
Requires-Dist: openai-whisper==20250625; extra == "whisper"
Provides-Extra: experimental
Requires-Dist: omegaconf; extra == "experimental"
Dynamic: classifier
Dynamic: keywords
Dynamic: license-file
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python

# NeuronxDistributedInference

This package provides a model hub for running inference on Neuronx Distributed (NxD).

## Examples
This package includes examples that you can reference when you implement code that uses NxD Inference.
* `generation_demo.py` - A basic generation example for Llama.

## Run inference with the inference demo
This package includes an inference demo console script that you can use to run inference. This script includes benchmarking and accuracy checking features that are useful for developers to verify that their models and modules work correctly.

After you install this package, you can run the inference demo with `inference-demo`. See examples below for how to run the inference demo. You can also run `inference_demo --help` to view all available arguments.

### Example 1: Llama inference with token matching accuracy check
```
inference_demo \
  --model-type llama \
  --task-type causal-lm \
  run \
    --model-path /home/ubuntu/model_hf/Llama-3.1-8B-Instruct/ \
    --compiled-model-path /home/ubuntu/traced_model/Llama-3.1-8B-Instruct/ \
    --torch-dtype bfloat16 \
    --tp-degree 32 \
    --batch-size 2 \
    --max-context-length 32 \
    --seq-len 64 \
    --on-device-sampling \
    --enable-bucketing \
    --top-k 1 \
    --pad-token-id 2 \
    --prompt "I believe the meaning of life is" \
    --prompt "The color of the sky is" \
    --check-accuracy-mode token-matching \
    --benchmark
```

### Example 2. DBRX inference with logit matching accuracy check

```
inference_demo \
  --model-type dbrx \
  --task-type causal-lm \
  run \
    --model-path /home/ubuntu/model_hf/dbrx-1layer/ \
    --compiled-model-path /home/ubuntu/traced_model/dbrx-1layer-demo/ \
    --torch-dtype bfloat16 \
    --tp-degree 32 \
    --batch-size 2 \
    --max-context-length 1024 \
    --seq-len 1152 \
    --enable-bucketing \
    --top-k 1 \
    --pad-token-id 0 \
    --prompt "I believe the meaning of life is" \
    --prompt "The color of the sky is" \
    --check-accuracy-mode logit-matching
```

### Example 3. Llama with speculation

```
inference_demo \
  --model-type llama \
  --task-type causal-lm \
  run \
    --model-path /home/ubuntu/model_hf/open_llama_7b/ \
    --compiled-model-path /home/ubuntu/traced_model/open_llama_7b/ \
    --draft-model-path /home/ubuntu/model_hf/open_llama_3b/ \
    --compiled-draft-model-path /home/ubuntu/traced_model/open_llama_3b/ \
    --torch-dtype bfloat16 \
    --tp-degree 32 \
    --batch-size 1 \
    --max-context-length 32 \
    --seq-len 64 \
    --enable-bucketing \
    --speculation-length 5 \
    --top-k 1 \
    --pad-token-id 2 \
    --prompt "I believe the meaning of life is" \
    --check-accuracy-mode token-matching \
    --benchmark
```

### Example 4. Llama with quantization

```
inference_demo \
  --model-type llama \
  --task-type causal-lm \
  run \
    --model-path /home/ubuntu/model_hf/Llama-2-7b/ \
    --compiled-model-path /home/ubuntu/traced_model/Llama-2-7b-demo/ \
    --torch-dtype bfloat16 \
    --tp-degree 32 \
    --batch-size 2 \
    --max-context-length 32 \
    --seq-len 64 \
    --on-device-sampling \
    --enable-bucketing \
    --quantized \
    --quantized-checkpoints-path /home/ubuntu/model_hf/Llama-2-7b/model_quant.pt \
    --quantization-type per_channel_symmetric \
    --top-k 1 \
    --pad-token-id 2 \
    --prompt "I believe the meaning of life is" \
    --prompt "The color of the sky is"
```

### Example 5. Llama inference with logit matching accuracy check using custom error tolerances

```
inference_demo \
  --model-type llama \
  --task-type causal-lm \
  run \
    --model-path /home/ubuntu/model_hf/Llama-2-7b/ \
    --compiled-model-path /home/ubuntu/traced_model/Llama-2-7b-demo/ \
    --torch-dtype bfloat16 \
    --tp-degree 32 \
    --batch-size 2 \
    --max-context-length 32 \
    --seq-len 64 \
    --check-accuracy-mode logit-matching \
    --divergence-difference-tol 0.005 \
    --tol-map "{5: (1e-5, 0.02)}" \
    --enable-bucketing \
    --top-k 1 \
    --pad-token-id 2 \
    --prompt "I believe the meaning of life is" \
    --prompt "The color of the sky is"
```
