Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
71 changes: 56 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,9 @@
[![Python Versions](https://img.shields.io/pypi/pyversions/medium2md-cli.svg)](https://pypi.org/project/medium2md-cli/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

> Convert a Medium export ZIP into clean Markdown with localized images, optimized for Hugo and compatible with Obsidian knowledge bases.
> Convert a Medium export ZIP into clean Markdown with localized images, optimized for Hugo and Obsidian.

**medium2md** is a CLI tool that transforms Medium's HTML export into properly structured Markdown with localized assets. Today, output is optimized for [Hugo](https://gohugo.io/) page bundles and is also readable in [Obsidian](https://obsidian.md/) vaults; planned roadmap work adds stronger Obsidian-specific formatting conventions.
**medium2md** is a CLI tool that transforms Medium's HTML export into properly structured Markdown with localized assets. Output can be generated as [Hugo](https://gohugo.io/) page bundles (default) or as flat [Obsidian](https://obsidian.md/) vault notes, selectable with the `--format` flag.

---

Expand Down Expand Up @@ -42,7 +42,7 @@ Medium allows you to export your account data as a ZIP archive, but the raw expo
| Canonical URL | Preserves the original Medium URL |
| Conversion reports | Summarizes what was converted and what was skipped |
| Incremental re-runs | *(planned)* Re-run only changed posts |
| Obsidian compatibility | Current output is Obsidian-readable; dedicated Obsidian formatting profile is planned |
| Obsidian compatibility | Flat `.md` notes with Obsidian-style front matter (`title`, `source`); assets in a shared `assets/` folder |

This tool is designed to be **deterministic**, **reproducible**, and **CI-friendly**.

Expand All @@ -59,7 +59,8 @@ Generate correctly formatted Markdown files from Medium posts, with images local
- Convert Medium export ZIP (posts under `posts/` in the export)
- Extract title and canonical URL; generate slug
- Convert HTML to Markdown
- Create Hugo page bundles with `index.md` and optional `images/`
- **Hugo format** (default): Hugo page bundles with `index.md` and optional `images/`
- **Obsidian format**: flat `.md` notes with Obsidian-style front matter (`title`, `source`); images in shared `assets/<slug>/`
- Image localization: download remote images into the bundle; copy local images when present in the export
- Basic slug collision handling (`slug-2`, `slug-3`, …)
- Terminal progress and summary; per-post image count; prompt to create missing output dir
Expand All @@ -73,14 +74,12 @@ Generate correctly formatted Markdown files from Medium posts, with images local
- Verification command
- Theme-specific front matter mapping
- Conversion report (e.g. JSON/file)
- Obsidian-friendly output profile (e.g., front matter + file layout conventions for vault workflows)

### Known limitations (current)

- Front matter currently includes `title`, `slug`, `draft`, and optional `medium.canonical`; date/tags are not extracted yet.
- Front matter currently includes `title`, `slug`, `draft`, and optional `medium.canonical` (Hugo) or `title` and `source` (Obsidian); date/tags are not extracted yet.
- Embedded content is not converted to Hugo shortcodes yet.
- Incremental conversion/state tracking is not implemented yet.
- Output structure is Hugo-first (`content/posts/<slug>/index.md`); a dedicated Obsidian output mode is not implemented yet.

---

Expand Down Expand Up @@ -117,9 +116,24 @@ uv run medium2md input/medium-export.zip --out ../blog/content/posts

> **Note:** The `input/` directory is tracked by git (via `.gitkeep`) so it exists after a fresh clone, but its contents are ignored — your ZIP files will never be accidentally committed.

### Front Matter Example
### Choosing an output format

Each converted post produces an `index.md` with Hugo-compatible YAML front matter. Current output:
Use `--format` (or `-f`) to select the output format:

- **`hugo`** (default): each post becomes a Hugo page bundle at `<out>/<slug>/index.md` with images at `<out>/<slug>/images/`.
- **`obsidian`**: each post becomes a flat note at `<out>/<slug>.md` with images at `<out>/assets/<slug>/`. Obsidian-style front matter (`title`, `source`) is used instead of Hugo keys.

```bash
# Hugo format (default)
uv run medium2md input/medium-export.zip --out content/posts

# Obsidian format
uv run medium2md input/medium-export.zip --out my-vault/posts --format obsidian
```

### Front Matter Examples

**Hugo format** (`--format hugo`, default):

```yaml
---
Expand All @@ -131,13 +145,24 @@ medium:
---
```

Additional keys (e.g. `date`, `lastmod`, `tags`) are planned.
**Obsidian format** (`--format obsidian`):

```yaml
---
title: "My Post Title"
source: "https://medium.com/@you/post-slug"
---
```

Additional keys (e.g. `date`, `lastmod`, `tags`) are planned for both formats.

---

## Output Structure

Each Medium post becomes a Hugo page bundle. Image links in the Markdown point into the bundle’s `images/` folder (remote images are downloaded; local images from the export are copied):
### Hugo format (default)

Each Medium post becomes a Hugo page bundle. Image links in the Markdown point into the bundle's `images/` folder (remote images are downloaded; local images from the export are copied):

```
content/posts/
Expand All @@ -149,6 +174,22 @@ content/posts/
└── …
```

### Obsidian format (`--format obsidian`)

Each Medium post becomes a flat Markdown note. Images are placed in a shared `assets/` folder:

```
my-vault/posts/
├── my-post-slug.md
├── another-post.md
└── assets/
├── my-post-slug/
│ ├── 1.png
│ └── 2.jpg
└── another-post/
└── 1.png
```

---

## Project Structure
Expand Down Expand Up @@ -184,13 +225,13 @@ ZIP → extract → find posts → parse HTML → localize images (copy/download
| Milestone | Focus | Status |
|---|---|---|
| 1 — Core conversion | ZIP ingestion, post discovery, HTML→Markdown conversion, Hugo bundle writing, local/remote image localization, slug collision handling | ✅ Implemented |
| 2 — Content fidelity + verification | Better metadata extraction (`date`, tags), machine-readable conversion report, `verify` command, clearer failure reporting, Obsidian formatting compatibility review | 📋 Planned |
| 3 — Incremental + extensibility | Incremental state tracking, embed conversion, output-profile mapping (Hugo/Obsidian), optional Pandoc backend, internal link rewriting | 📋 Planned |
| 2 — Content fidelity + verification | Better metadata extraction (`date`, tags), machine-readable conversion report, `verify` command, clearer failure reporting, Obsidian output format (`--format obsidian`) | ✅ Implemented (Obsidian format); 📋 Planned (date/tags, verification) |
| 3 — Incremental + extensibility | Incremental state tracking, embed conversion, optional Pandoc backend, internal link rewriting | 📋 Planned |

### Roadmap status snapshot (code-verified)

- The repository has implemented the core `convert` flow end-to-end.
- Milestone 2 is the highest-impact next step for knowledge-base quality (`date`/tags extraction, verification/reporting, Obsidian compatibility conventions).
- The repository has implemented the core `convert` flow end-to-end for both Hugo and Obsidian output formats.
- Milestone 2 next steps: `date`/tags extraction and a `verify` command are the highest-impact remaining items.
- Milestone 3 remains optional/polish after fidelity and verification are stable.

---
Expand Down
38 changes: 32 additions & 6 deletions medium2md/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,20 +4,29 @@
import typer

from medium2md.pipeline import (
OutputFormat,
find_post_html_files,
get_title_canonical,
convert_html_file,
slug_from_post,
write_bundle,
write_note,
)

app = typer.Typer(help="Convert a Medium export ZIP into Hugo page bundles.")
app = typer.Typer(help="Convert a Medium export ZIP into Hugo page bundles or Obsidian notes.")


@app.command()
def convert(
export_zip: Path = typer.Argument(..., exists=True, file_okay=True, dir_okay=False),
out: Path = typer.Option(Path("content/posts"), "--out", "-o"),
fmt: OutputFormat = typer.Option(
OutputFormat.hugo,
"--format",
"-f",
help="Output format: 'hugo' (default) produces Hugo page bundles; 'obsidian' produces flat Markdown notes.",
show_default=True,
),
):
out = out.resolve()
if not out.exists():
Expand All @@ -34,6 +43,7 @@ def convert(

typer.echo(f"Export: {export_zip}")
typer.echo(f"Out: {out}")
typer.echo(f"Format: {fmt.value}")

with tempfile.TemporaryDirectory(prefix="medium2md_") as td:
tmp_dir = Path(td)
Expand Down Expand Up @@ -72,13 +82,29 @@ def convert(
suffix = 2 if slug == base_slug else int(slug.split("-")[-1]) + 1
slug = f"{base_slug}-{suffix}"
used_slugs.add(slug)
bundle_dir = out / slug
bundle_dir.mkdir(parents=True, exist_ok=True)
title, canonical, body_md, num_images = convert_html_file(html_path, tmp_dir, bundle_dir)
write_bundle(out, slug, title, canonical, body_md)

if fmt == OutputFormat.hugo:
bundle_dir = out / slug
bundle_dir.mkdir(parents=True, exist_ok=True)
title, canonical, body_md, num_images = convert_html_file(html_path, tmp_dir, bundle_dir)
write_bundle(out, slug, title, canonical, body_md)
output_path = out / slug / "index.md"
else:
# Obsidian: flat note + shared assets folder
assets_dir = out / "assets" / slug
title, canonical, body_md, num_images = convert_html_file(
html_path,
tmp_dir,
out,
images_dir=assets_dir,
src_prefix=f"assets/{slug}/",
)
write_note(out, slug, title, canonical, body_md)
output_path = out / f"{slug}.md"

written += 1
img_info = f" ({num_images} image(s))" if num_images else ""
typer.echo(f" [{i}/{len(post_files)}] {slug} → {out / slug / 'index.md'}{img_info}")
typer.echo(f" [{i}/{len(post_files)}] {slug} → {output_path}{img_info}")
except Exception as e:
errors += 1
typer.echo(
Expand Down
56 changes: 47 additions & 9 deletions medium2md/pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

import shutil
import time
from enum import Enum
from pathlib import Path
from urllib.parse import urlparse

Expand All @@ -10,6 +11,13 @@
from markdownify import markdownify as md
from slugify import slugify


class OutputFormat(str, Enum):
"""Supported output formats for converted Markdown files."""

hugo = "hugo"
obsidian = "obsidian"

try:
import httpx
except ImportError:
Expand Down Expand Up @@ -107,12 +115,14 @@ def _localize_images(
article_soup: BeautifulSoup,
html_path: Path,
tmp_dir: Path,
bundle_dir: Path,
images_dir: Path,
src_prefix: str,
) -> int:
"""In-place: resolve each img src to a local file or download, copy into bundle/images/, set src to images/<name>. Returns count of images localized."""
images_dir = bundle_dir / "images"
images_dir.mkdir(parents=True, exist_ok=True)
"""In-place: resolve each img src to a local file or download, copy into images_dir, set src to src_prefix<name>. Returns count of images localized."""
imgs = article_soup.find_all("img", src=True)
if not imgs:
return 0
images_dir.mkdir(parents=True, exist_ok=True)
localized = 0
for i, img in enumerate(imgs, 1):
src = img["src"].strip()
Expand All @@ -131,7 +141,7 @@ def _localize_images(
dest_name = f"{i}{ext}"
dest = images_dir / dest_name
shutil.copy2(resolved, dest)
img["src"] = f"images/{dest_name}"
img["src"] = f"{src_prefix}{dest_name}"
localized += 1
continue
# Remote URL: download (with User-Agent and retry so CDNs don't block)
Expand All @@ -154,7 +164,7 @@ def _localize_images(
dest_name = f"{i}{ext}"
dest = images_dir / dest_name
dest.write_bytes(r.content)
img["src"] = f"images/{dest_name}"
img["src"] = f"{src_prefix}{dest_name}"
localized += 1
last_error = None
break
Expand All @@ -172,8 +182,14 @@ def convert_html_file(
html_path: Path,
tmp_dir: Path,
bundle_dir: Path,
*,
images_dir: Path | None = None,
src_prefix: str = "images/",
) -> tuple[str, str | None, str, int]:
"""Parse one post HTML file, localize images into bundle_dir/images/, return (title, canonical_url, markdown_body, num_images_localized)."""
"""Parse one post HTML file, localize images, return (title, canonical_url, markdown_body, num_images_localized).

Images are saved into *images_dir* (defaults to bundle_dir/images) and referenced with *src_prefix* in the Markdown.
"""
raw = html_path.read_text(encoding="utf-8", errors="replace")
soup = BeautifulSoup(raw, "lxml")
title = _extract_title(soup)
Expand All @@ -183,8 +199,9 @@ def convert_html_file(
body_md = ""
localized = 0
else:
resolved_images_dir = images_dir if images_dir is not None else bundle_dir / "images"
article_soup = BeautifulSoup(article_html, "lxml")
localized = _localize_images(article_soup, html_path, tmp_dir, bundle_dir)
localized = _localize_images(article_soup, html_path, tmp_dir, resolved_images_dir, src_prefix)
body_md = md(
str(article_soup),
heading_style="ATX",
Expand All @@ -211,7 +228,7 @@ def write_bundle(out_root: Path, slug: str, title: str, canonical: str | None, b
"""Write one Hugo page bundle: out_root/<slug>/index.md. Returns path to index.md."""
bundle_dir = out_root / slug
bundle_dir.mkdir(parents=True, exist_ok=True)
front = {
front: dict = {
"title": title,
"draft": True,
"slug": slug,
Expand All @@ -227,3 +244,24 @@ def write_bundle(out_root: Path, slug: str, title: str, canonical: str | None, b
if body_md and not body_md.endswith("\n"):
f.write("\n")
return index_md


def write_note(out_root: Path, slug: str, title: str, canonical: str | None, body_md: str) -> Path:
"""Write one Obsidian note: out_root/<slug>.md. Returns path to the note.

Front matter uses Obsidian conventions: ``title`` and ``source`` (canonical URL).
Images are expected to reside in ``out_root/assets/<slug>/`` and are referenced
as ``assets/<slug>/<name>`` in the Markdown body.
"""
front: dict = {"title": title}
if canonical:
front["source"] = canonical
note_path = out_root / f"{slug}.md"
with note_path.open("w", encoding="utf-8") as f:
f.write("---\n")
f.write(yaml.dump(front, default_flow_style=False, allow_unicode=True, sort_keys=False))
f.write("---\n\n")
f.write(body_md)
if body_md and not body_md.endswith("\n"):
f.write("\n")
return note_path
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ Repository = "https://github.com/edgarbc/medium2md"
Documentation = "https://github.com/edgarbc/medium2md#readme"

[project.optional-dependencies]
dev = ["twine>=6.0"]
dev = ["pytest>=8.0", "twine>=6.0"]

[tool.hatch.build.targets.wheel]
packages = ["medium2md"]
Empty file added tests/__init__.py
Empty file.
Loading