Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs-style-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -209,7 +209,7 @@ Help readers find information quickly by organizing content into clear levels of

2. **Section heading (H2)** -- key steps or concepts, e.g., *Realtime processing*.

3. **Subheading (H3+)** -- finer details within a section, e.g., *Operating points*.
3. **Subheading (H3+)** -- finer details within a section, e.g., *Models*.

4. **Paragraph** -- up to 3 sentences per paragraph.

Expand Down
12 changes: 6 additions & 6 deletions docs/deployments/container/accessing-images.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -64,21 +64,21 @@ See [how to run the Core Speech CPU container here.](/deployments/container/cpu-

The Transcription GPU images are required to use the most accurate models.

### Standard operating point
### Standard model

There is a single image available that supports all languages for the Standard Operating Point. There are language specific images available that support the Enhanced and Standard Operating Point.
There is a single image available that supports all languages for the Standard model. There are language specific images available that support the Enhanced and Standard models.

<CodeBlock>
{`# pulling the Standard operating point Transcription GPU inference server which supports all languages with the ${smVariables.latestContainerVersion} tag:
{`# pulling the Standard model Transcription GPU inference server which supports all languages with the ${smVariables.latestContainerVersion} tag:
docker pull speechmaticspublic.azurecr.io/sm-gpu-inference-server-standard-all:${smVariables.latestContainerVersion}
\0
# pulling language specific Transcription GPU inference servers available for en, es, de, fr. Supports both Enhanced and Standard operating points with the ${smVariables.latestContainerVersion} tag:
# pulling language specific Transcription GPU inference servers available for en, es, de, fr. Supports both Enhanced and Standard models with the ${smVariables.latestContainerVersion} tag:
docker pull speechmaticspublic.azurecr.io/sm-gpu-inference-server-en:${smVariables.latestContainerVersion}`}
</CodeBlock>

### Enhanced operating point
### Enhanced model

Depending on which Enhanced Operating Point languages are required, you can pull specific images.
Depending on which Enhanced model languages are required, you can pull specific images.

<details open>
<summary>Language Pack 1</summary>
Expand Down
4 changes: 2 additions & 2 deletions docs/deployments/container/batch-persistent-worker.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ curl -X POST address.of.container:PORT/v2/jobs \
"transcription_config": {
"language": "en",
"diarization": "speaker",
"operating_point": "enhanced"
"model": "enhanced"
}
}' \
-F 'data_file=@~/audio_file.mp3'
Expand Down Expand Up @@ -201,7 +201,7 @@ curl -X POST address.of.container:PORT/v2/jobs \
"transcription_config": {
"language": "en",
"diarization": "speaker",
"operating_point": "enhanced"
"model": "enhanced"
}
}' \
-F 'data_file=@~/audio_file.mp3'
Expand Down
4 changes: 2 additions & 2 deletions docs/deployments/container/cpu-speech-to-text.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -257,7 +257,7 @@ The first-session loading time can be reduced down to several hundred millisecon
You can enable this feature by setting the `SM_PREWARM_ENGINE_MODES` environment variable, with a semicolon separated list describing the required engine modes. For example, to prewarm 1 English GPU Standard and 2 English GPU Enhanced:
`SM_PREWARM_ENGINE_MODES='en_general_gpu_standard:1;en_general_gpu_enhanced:2'`

In general, the format is: `{language}_{domain}_{processor}_{operating_point}:{prewarm_connections}`.
In general, the format is: `{language}_{domain}_{processor}_{model}:{prewarm_connections}`.

The parameters are:
- `language` - One of the supported [language codes](/speech-to-text/languages)
Expand All @@ -266,7 +266,7 @@ The parameters are:

- `processor` - One of `cpu` or `gpu`. Note that selecting `gpu` requires a [GPU Inference Container](/deployments/container/gpu-speech-to-text)

- `operating_point` - One of `standard` or `enhanced`. The [operating point](/speech-to-text/models) you want to prewarm
- `model` - One of `standard` or `enhanced`. The [model](/speech-to-text/models) you want to prewarm

- `prewarm_connections` - Integer. The number of engine instances of the specific mode you want to pre-warm. The total number of `prewarm_connections` cannot be greater than `SM_MAX_CONCURRENT_CONNECTIONS`. After the pre-warming is complete, this parameter does not limit the types of connections the engine can start.

Expand Down
14 changes: 8 additions & 6 deletions docs/deployments/container/gpu-speech-to-text.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -105,14 +105,16 @@ The server can only support one of these modes at once.

Once the GPU Server is running, follow the [Instructions for Linking a CPU Container](/deployments/container/cpu-speech-to-text#linking-to-a-gpu-inference-container).

### Running only one operating point
### Running only one model

[Operating Points](/speech-to-text/models) represent different levels of model complexity.
To save GPU memory for throughput, you can run the server with only one Operating Point loaded. To do this, pass the
`SM_OPERATING_POINT` environment variable to the container and set it to either `standard` or `enhanced`.
[Models](/speech-to-text/models) (previously called Operating Points) represent different levels of model complexity.
To save GPU memory for throughput, you can run the server with only one model loaded. To do this, pass the
`SM_MODEL` environment variable to the container and set it to either `standard` or `enhanced`.

`SM_MODEL` replaces the older `SM_OPERATING_POINT` environment variable. `SM_OPERATING_POINT` is deprecated but still works and accepts the same `standard` and `enhanced` values; use `SM_MODEL` going forward.

:::info
When running the all language standard Operating Point GPU inference server you must set the `SM_OPERATING_POINT` environment variable to `standard`
When running the all language standard model GPU inference server you must set the `SM_MODEL` environment variable to `standard`
:::

### Monitoring the server
Expand All @@ -121,7 +123,7 @@ The inference server is based on [Nvidia's Triton architecture](https://develope
can be monitored using Triton's inbuilt Prometheus metrics, or the GRPC/HTTP APIs. To expose these, configure an external mapping for port
8002(Prometheus) or 8000(HTTP).

### Operating points in GPU inference
### Models in GPU inference

When inference is outsourced to a GPU server, alternative GPU-specific models are used, so you should not expect to see identical results compared to CPU-based inference. For convenience, the GPU models are also designated as 'standard' and 'enhanced'.

Expand Down
2 changes: 1 addition & 1 deletion docs/deployments/container/gpu-translation.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,7 @@ Assuming the following config file:
{
"type": "transcription",
"transcription_config": {
"operating_point": "enhanced",
"model": "enhanced",
"language": "en"
},
"translation_config": {
Expand Down
8 changes: 4 additions & 4 deletions docs/deployments/container/performance-and-cost.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ This is a comparison of the performance and estimated running costs of transcrip
### Batch transcription


| Operating Point | [CPU Standard](./cpu-speech-to-text) | [CPU Enhanced](./cpu-speech-to-text) | [GPU Standard](./gpu-speech-to-text) | [GPU Enhanced](./gpu-speech-to-text) |
| Models | [CPU Standard](./cpu-speech-to-text) | [CPU Enhanced](./cpu-speech-to-text) | [GPU Standard](./gpu-speech-to-text) | [GPU Enhanced](./gpu-speech-to-text) |
|--------------------------------------------|--------------|--------------|--------------|--------------|
| Lowest Processing Cost (US ¢ per hour) | 1.7 | 3.8 | 0.34 | 1.67 |
| Cost vs CPU Standard (%) | - | 224% | 20% | 98% |
Expand All @@ -31,13 +31,13 @@ The benchmark uses the following configuration:
| Price Basis | Azure PAYG East US, Linux, Standard |

:::note
For GPU Operating Points, transcribers and inference servers were all run on a single VM node.
For GPU Models, transcribers and inference servers were all run on a single VM node.
:::


### Realtime transcription

| Operating Point | [CPU Standard](./cpu-speech-to-text#realtime-transcription) | [CPU Enhanced](./cpu-speech-to-text#realtime-transcription) | [GPU Standard](./gpu-speech-to-text#batch-and-real-time-inference) | [GPU Enhanced](./gpu-speech-to-text#batch-and-real-time-inference) |
| Models | [CPU Standard](./cpu-speech-to-text#realtime-transcription) | [CPU Enhanced](./cpu-speech-to-text#realtime-transcription) | [GPU Standard](./gpu-speech-to-text#batch-and-real-time-inference) | [GPU Enhanced](./gpu-speech-to-text#batch-and-real-time-inference) |
|--------------------------------------------|--------------|--------------|--------------|--------------|
| Lowest Processing Cost (US ¢ per hour) | 1.97 | 2.95 | 0.86 | 2.51 |
| Cost vs. CPU Standard (%) | - | 150% | 44% | 127% |
Expand All @@ -55,7 +55,7 @@ This benchmark uses the following configuration[^4]:
| Price Basis | Azure PAYG East US, Linux, Standard |

:::note
For GPU Operating Points, the transcribers and inference servers were run on a single VM node.
For GPU Models, the transcribers and inference servers were run on a single VM node.

Each first session, transcriber requires 0.25 cores for both OPs, with 1.2 GB memory (Standard OP) or 3 GB memory (Enhanced OP). Every additional session consumes 0.1 cores and 100 MB of memory.
:::
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ async function transcribeFile() {
{
transcription_config: {
language: "en",
operating_point: "enhanced",
model: "enhanced",
},
},
"json-v2",
Expand Down
2 changes: 1 addition & 1 deletion docs/speech-to-text/features/audio-events.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -293,4 +293,4 @@ An example of a request only for `applause` and `music`
- Audio Events is supported only in the JSON type API response
- While the occurrence of music can be detected, richer metadata about the music such as title, artist, genre, etc cannot be identified
- Only one instance of an event type can be tracked at a point in time. e.g. seamlessly switching consecutive songs will be detected as one single music event
- For On-Prem Containers, Audio Events is available only for GPU Operating Points
- For On-Prem Containers, Audio Events is available only for GPU Models