Skip to content

docs: improve README with install guide, feature matrix, and troubleshooting#61

Open
Pendu wants to merge 3 commits intotraceopt-ai:mainfrom
Pendu:docs/improve-readme
Open

docs: improve README with install guide, feature matrix, and troubleshooting#61
Pendu wants to merge 3 commits intotraceopt-ai:mainfrom
Pendu:docs/improve-readme

Conversation

@Pendu
Copy link
Copy Markdown
Contributor

@Pendu Pendu commented Mar 21, 2026

Summary

  • Added prerequisites table, virtual env setup, and install extras ([torch], [hf], [lightning]) to prevent ModuleNotFoundError on fresh instances
  • Added feature matrix showing capabilities across watch/run/deep modes
  • Added comparison table (TraceML vs W&B / Neptune / TensorBoard)
  • Added examples table with required extras, troubleshooting section, and link to CONTRIBUTING.md

Motivation

Running pip install traceml-ai on a fresh instance and then running examples fails with missing torch/transformers. The README needed clearer prerequisites and dependency guidance.

Copy link
Copy Markdown
Contributor

@abhinavsriva abhinavsriva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please have a look at the changes. The goal is to keep readme simple and quick to scan. Everything else goes into docs.


## Quick start

### Prerequisites
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should go into quickstart then here


For local review and comparison, TraceML also includes a local UI. See [`docs/quickstart.md`](docs/quickstart.md) for setup details.

![TraceML local UI](docs/assets/local_ui.png)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed it as there were too many images on readme

## TraceML vs alternatives

TraceML is for lightweight diagnosis during real PyTorch training runs.
| Capability | TraceML | W&B / Neptune | TensorBoard Profiler |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove W&B/Neptune they are not runtime systems and compare only with Tensorboard or Pytorch profiler

I would also remove these options

Experiment tracking & comparison | ❌ | ✅ | ✅ |
| Hyperparameter sweeps | ❌ | ✅ | ❌ |
| Team collaboration | ❌ | ✅ | ❌ |

they are not scope of TraceML. These are more for on platform systems.

| Hyperparameter sweeps | ❌ | ✅ | ❌ |
| Team collaboration | ❌ | ✅ | ❌ |

It is **not**:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will keep the original. Again W&B/Neptune are not in comparison. In fact another PR tried to put this into WandB.

- basic example
- input / dataloader stall
- DDP straggler / rank skew
| Example | Requires | Description |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good. But please remove requires column. It is suppose to complement so expectgation is that HF/torch would be installed already. Usual workflows already have torch, reinstall can easily prod environments.


---

## Troubleshooting
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move this to quick start. This is good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants