slashml · pr-test1 · Sep 16, 2025 · Sep 16, 2025 · Sep 16, 2025
diff --git a/mint.json b/mint.json
@@ -38,14 +38,15 @@
     "mode": "auto"
   },
   "navigation": [
-    { 
+    {
       "group": "Getting Started",
       "pages": ["about", "installation", "quick-start"]
     },
     {
       "group": "Tutorials",
       "pages": [
         "tutorials/deploying-llama-3-to-aws",
+        "tutorials/deploying-llama-3-to-aws-using-query-flag",
         "tutorials/deploying-llama-3-to-gcp",
         "tutorials/deploying-llama-3-to-azure"
       ]

diff --git a/quick-start.mdx b/quick-start.mdx
@@ -35,14 +35,16 @@ From the dropdown, select `Delete a Model Endpoint` to see the list of models en
 ![Delete Endpoints](../Images/delete-1.png)
 
 
-### Querying Models
+### Querying Models (Interactive)
 
-From the dropdown, select `Query a Model Endpoint` to see the list of models endpoints. Press space to select the endpoints you want to query. Enter the query in the text box and press enter to get the response.
+From the dropdown, select `Query a Model Endpoint` to see the list of model endpoints. Press space to select the endpoints you want to query. Enter the query in the text box and press enter to get the response.
 
 ![Query Endpoints](../Images/query-1.png)
 
 
-### YAML-based Deployment (Recommended)
+---
+
+## YAML-based Deployment (Recommended)
 
 For reproducible deployments, use YAML configuration:
 
@@ -128,6 +130,37 @@ models:
 </Note>
 
 
+---
+
+## YAML-based Querying (New)
+
+Once an endpoint is deployed you can issue batch or ad-hoc queries directly from a YAML file without opening the interactive menu.
+
+1. Create a query YAML (e.g. `llama3-query.yaml`):
+
+```yaml
+deployment: !Deployment
+  destination: aws
+  endpoint_name: llama3-endpoint
+
+query: !Query
+  input: "What are the key differences between Llama 2 and Llama 3?"
+```
+
+2. Execute the query with the new `--query` flag:
+
+```sh
+magemaker --query .magemaker_config/llama3-query.yaml
+```
+
+This flag is now available for all three cloud providers (AWS, GCP, Azure) and mirrors the request/response you would get when using the SDKs directly.
+
+<Warning>
+End-points continue to accrue costs while running—remember to delete them when you’re done!
+</Warning>
+
+
+---
 
 ### Model Fine-tuning
 
@@ -150,26 +183,6 @@ training: !Training
     per_device_train_batch_size: 32
     learning_rate: 2e-5
 ```
-{/* 
-### Recommended Models
-
-<CardGroup>
-  <Card
-    title="google-bert/bert-base-uncased"
-    href="https://huggingface.co/google-bert/bert-base-uncased"
-  >
-    Fill Mask: tries to complete your sentence like Madlibs. Query format: text
-    string with [MASK] somewhere in it.
-  </Card>
-
-  <Card
-    title="sentence-transformers/all-MiniLM-L6-v2"
-    href="https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2"
-  >
-    Feature extraction: turns text into a 384d vector embedding for semantic
-    search / clustering. Query format: "type out a sentence like this one."
-  </Card>
-</CardGroup> */}
 
 <Warning>
   Remember to deactivate unused endpoints to avoid unnecessary charges!
@@ -180,7 +193,6 @@ training: !Training
 
 You can reach us, faizan & jneid, at [support@slashml.com](mailto:support@slashml.com).
 
-
 If anything doesn't make sense or you have suggestions, do point them out at [magemaker.featurebase.app](https://magemaker.featurebase.app/).
 
 We'd love to hear from you! We're excited to learn how we can make this more valuable for the community and welcome any and all feedback and suggestions.
diff --git a/tutorials/deploying-llama-3-to-aws-using-query-flag.mdx b/tutorials/deploying-llama-3-to-aws-using-query-flag.mdx
@@ -0,0 +1,89 @@
+---
+title: Deploying Llama 3 to SageMaker using the Query Flag
+---
+
+## Introduction
+This tutorial guides you through deploying Llama 3 to AWS SageMaker using Magemaker **and** shows how to query it using the new `--query` flag. Ensure you have followed the [installation](installation) steps before proceeding.
+
+## Step 1: Setting Up Magemaker for AWS
+Run the following command to configure Magemaker for AWS SageMaker deployment:
+```sh
+magemaker --cloud aws
+```
+This initializes Magemaker with the necessary configurations for deploying models to SageMaker.
+
+## Step 2: YAML-based Deployment
+For reproducible deployments, use YAML configuration:
+```sh
+magemaker --deploy .magemaker_config/llama3-deploy.yaml
+```
+
+Example deployment YAML:
+```yaml
+deployment: !Deployment
+  destination: aws
+  endpoint_name: llama3-endpoint
+  instance_count: 1
+  instance_type: ml.g5.2xlarge
+  num_gpus: 1
+  quantization: null
+models:
+  - !Model
+    id: meta-llama/Meta-Llama-3-8B-Instruct
+    location: null
+    predict: null
+    source: huggingface
+    task: text-generation
+    version: null
+```
+
+<Note>
+   For gated models like Llama from Meta you must (1) accept the model licence on Hugging Face **and** (2) provide a valid `HUGGING_FACE_HUB_KEY` in your environment for the deployment to succeed.
+</Note>
+
+<Warning>
+You may need to request a quota increase for specific machine types and GPUs in the region where you plan to deploy the model. Check your AWS quotas before proceeding.
+</Warning>
+
+## Step 3: Querying with the `--query` Flag
+After the deployment finishes you can issue requests directly from the CLI without the interactive dropdown.
+
+### 3.1 Create a Query YAML
+Create `llama3-query.yaml`:
+```yaml
+deployment: !Deployment
+  destination: aws 
+  endpoint_name: llama3-endpoint
+
+query: !Query
+  input: "Explain the concept of quantum entanglement in simple terms."
+```
+
+### 3.2 Execute the Query
+```sh
+magemaker --query .magemaker_config/llama3-query.yaml
+```
+
+Sample Response:
+```json
+{
+  "generated_text": "Quantum entanglement is like having two magical coins…",
+  "model": "meta-llama/Meta-Llama-3-8B-Instruct",
+  "total_tokens": 95,
+  "generation_time": 1.3
+}
+```
+
+## Step 4: Programmatic Query (Python)
+You can also call the endpoint via the SageMaker SDK:
+```python
+from sagemaker.huggingface.model import HuggingFacePredictor
+import sagemaker
+
+predictor = HuggingFacePredictor(endpoint_name="llama3-endpoint",
+                                 sagemaker_session=sagemaker.Session())
+print(predictor.predict({"inputs": "What are you?"}))
+```
+
+## Conclusion
+You have successfully deployed and queried Llama 3 on AWS SageMaker using Magemaker’s new `--query` workflow. For any questions or feedback, feel free to contact us at [support@slashml.com](mailto:support@slashml.com).