34 lines
1.4 KiB
Markdown
34 lines
1.4 KiB
Markdown
# Local TGI Deployment for SQLCoder
|
||
|
||
This docker compose file runs Hugging Face Text Generation Inference (TGI) serving the `defog/sqlcoder-7b-2` model.
|
||
|
||
## Prerequisites
|
||
- NVIDIA GPU with recent drivers and CUDA runtime that matches the `nvidia-container-toolkit` installation.
|
||
- Docker and `docker compose` v2.
|
||
- Hugging Face access token with model download permissions (`HUGGING_FACE_HUB_TOKEN`).
|
||
|
||
## Usage
|
||
1. Export your Hugging Face token in the shell where you run compose:
|
||
```powershell
|
||
$env:HUGGING_FACE_HUB_TOKEN = "hf_..."
|
||
```
|
||
2. Launch the stack:
|
||
```powershell
|
||
docker compose -f db_agent/deployment/docker-compose.yml up -d
|
||
```
|
||
3. Check logs:
|
||
```powershell
|
||
docker compose -f db_agent/deployment/docker-compose.yml logs -f
|
||
```
|
||
4. The TGI OpenAI-compatible endpoint will be available at `http://localhost:8080/v1`. Use it with `openai`-compatible SDKs or direct HTTP calls.
|
||
|
||
## Notes
|
||
- The compose file pins `CUDA_VISIBLE_DEVICES=2` to target the 24 GB RTX 3090; update if your GPU indices differ.
|
||
- Token limits are tightened (`--max-total-tokens=3072`, `--max-input-length=2048`) to stay within 16–24 GB cards.
|
||
- Models are cached on the `model-cache` volume to avoid re-downloading.
|
||
- To shut down:
|
||
```powershell
|
||
docker compose -f db_agent/deployment/docker-compose.yml down
|
||
```
|
||
- For CPU-only testing, remove the `deploy.resources` block and expect very slow inference.
|