Skip to content

Add markdown-to-blocks parser for proper Logseq block hierarchy#12

Open
RobertoGongora wants to merge 7 commits intoergut:mainfrom
RobertoGongora:feature/block-rendering
Open

Add markdown-to-blocks parser for proper Logseq block hierarchy#12
RobertoGongora wants to merge 7 commits intoergut:mainfrom
RobertoGongora:feature/block-rendering

Conversation

@RobertoGongora
Copy link

@RobertoGongora RobertoGongora commented Jan 7, 2026

Add markdown-to-blocks parser for proper Logseq block hierarchy

Fixes #7

This PR introduces intelligent markdown parsing that converts markdown content into Logseq's native block tree structure, fixing the issue where create_page and update_page would dump all content as a single block.

Key improvements:

  • ✅ Markdown → proper Logseq block hierarchy (headings, lists, nesting)
  • ✅ YAML frontmatter → page properties (no more MethodNotExist error)
  • ✅ Code blocks preserved as single blocks (not split line-by-line)
  • ✅ TODO/DONE checkboxes without duplicate markers
  • ✅ Multi-line blockquotes join correctly
  • ✅ All nested blocks returned by get_page_content

📖 Quick Examples

Example 1: Simple Page Creation - Proper Hierarchy

Before (v1.0.1) ❌

# All content dumped as a single flat block
create_page("Project Notes", """
# Project Notes
- Task 1
  - Subtask A
- Task 2
""")

# Result in Logseq:
# Single block containing:
# "# Project Notes\n- Task 1\n  - Subtask A\n- Task 2"

After (v1.1.0) ✅

# Properly parsed into hierarchical blocks
create_page("Project Notes", """
# Project Notes
- Task 1
  - Subtask A
- Task 2
""")

# Result in Logseq:
# ├─ # Project Notes
# │  ├─ Task 1
# │  │  └─ Subtask A
# │  └─ Task 2

Try it yourself:

from mcp_logseq.logseq import LogSeq
from mcp_logseq.parser import parse_content

content = """# Project Notes
- Task 1
  - Subtask A
- Task 2"""

parsed = parse_content(content)
print(f"Blocks created: {len(parsed.blocks)}")
print(f"First block: {parsed.blocks[0].content}")
print(f"First block children: {len(parsed.blocks[0].children)}")
# Output:
# Blocks created: 1
# First block: # Project Notes
# First block children: 2
Example 2: Properties + TODO Lists - No More Errors

Before (v1.0.1) ❌

create_page("Tasks", """
---
priority: high
due-date: 2026-01-15
---

# Tasks
- [ ] TODO: Design feature
- [x] DONE: Initial setup
""")

# Result in Logseq:
# - Properties: error: MethodNotExist: get_page_properties
# - TODO markers duplicated: "TODO TODO: Design feature"
# - Date serialization error: "Object of type date is not JSON serializable"

After (v1.1.0) ✅

create_page("Tasks", """
---
priority: high
due-date: 2026-01-15
---

# Tasks
- [ ] TODO: Design feature
- [x] DONE: Initial setup
""")

# Result in Logseq:
# - Properties: {priority: "high", due-date: "2026-01-15"} ✅
# - Clean TODO markers: "TODO Design feature" ✅
# - Date serialized correctly to ISO string ✅

Try it yourself:

from mcp_logseq.parser import parse_content

content = """---
priority: high
due-date: 2026-01-15
---

# Tasks
- [ ] TODO: Design feature
- [x] DONE: Initial setup"""

parsed = parse_content(content)
print(f"Properties: {parsed.properties}")
print(f"First task: {parsed.blocks[0].children[0].content}")
print(f"Second task: {parsed.blocks[0].children[1].content}")
# Output:
# Properties: {'priority': 'high', 'due-date': '2026-01-15'}
# First task: TODO Design feature
# Second task: DONE Initial setup

Summary

Changes: 11 files (+2543, -682)
Tests: 103 passing with comprehensive coverage
Type checking: 0 errors

See commit message for full details on new features, API extensions, and behavioral changes.

Recommended version: v1.1.0 (feature release)

This PR introduces intelligent markdown parsing that converts markdown content
into Logseq's native block tree structure, fixing the issue where create_page
and update_page would dump all content as a single block.

## New Features

### Markdown Parser (src/mcp_logseq/parser.py)
- Parse YAML frontmatter into page properties
- Convert headings (H1-H6) into hierarchical block sections
- Handle nested bullet lists (-, *, +) with proper indentation
- Support numbered lists with nesting
- Convert checkboxes to Logseq TODO/DONE markers
- Preserve fenced code blocks as single blocks
- Join contiguous blockquote lines into single blocks
- Serialize date/datetime values to ISO strings for JSON compatibility

### Enhanced Tool Handlers
- CreatePageToolHandler: Now parses markdown and creates proper block hierarchy
- UpdatePageToolHandler: Added 'mode' parameter (append/replace)
- Both support YAML frontmatter for page properties

### API Extensions (src/mcp_logseq/logseq.py)
- insert_batch_block(): Use Logseq's insertBatchBlock for efficient bulk inserts
- create_page_with_blocks(): Create pages with proper block hierarchy
- update_page_with_blocks(): Update with append/replace modes
- clear_page_content(): Remove all blocks from a page
- Helper methods for block manipulation

## Tests
- 91 tests passing with comprehensive coverage
- 36 new parser tests for all markdown features
- Edge cases: deep nesting, special characters, empty content

## Behavioral Changes
- update_page now defaults to 'append' mode (adds content after existing blocks)
- Use mode='replace' to clear and replace all content

## Dependencies
- Added pyyaml>=6.0 for YAML frontmatter parsing

## Recommended Version
Consider bumping to v1.1.0 for this feature release.
Adds support for Logseq's flexible marker system where any capitalized
word (3+ characters) at the start of a line can be used as a task marker
with nested children.

This enables users to use custom workflow markers like TODO, DONE, DOING,
WAITING, LATER, NOW, IN-PROGRESS, etc., matching Logseq's native system.

## Pattern
- Matches: [A-Z][A-Z0-9_-]{2,} followed by space and text
- Minimum 3 total characters (e.g., NOW, TODO, DONE)
- Supports hyphens and underscores (IN-PROGRESS, PRIORITY_HIGH)
- Supports numbers (STEP1, PRIORITY2)
- Excludes: 2-char codes (CA, NY, US), lowercase, mixed-case

## Examples
DONE Task completed
  - Detail 1
  - Detail 2

NOW Urgent work
  - Immediate action needed

IN-PROGRESS Current project
  - Phase 1 complete
  - Phase 2 in progress

## Changes
- Add CAPITALIZED_MARKER_PATTERN regex to parser
- Update _parse_list_item_content() to handle capitalized markers
- Update main parsing loop to detect capitalized markers
- Update list parsing logic to recognize markers at same level

## Tests
- 12 new tests covering markers, nesting, edge cases
- All 103 tests passing (+12 from previous 91)
- 0 type errors
- Verified with real journal entry format

This allows journal entries like 'DONE Task' with nested children to work
correctly, matching common Logseq workflow patterns.
@RobertoGongora
Copy link
Author

RobertoGongora commented Jan 7, 2026

Fixed an issue with capitalized markers not being properly parsed. I'm using this and improving as I find things but it's already in a pretty usable state.

Edit: Additional improvements made on the get_page_contents tool so its friendlier with LLMs and cross-tool e.g. for full page replacements.

Fixes issue where get_page_content with format='text' only displayed
top-level blocks and ignored nested children, causing incomplete page
content display.

## Changes

### Core Fix
- Add _format_block_tree() recursive helper function to traverse and
  format block hierarchies with proper indentation
- Use 2-space indentation per nesting level (Logseq standard)
- Preserve TODO/DONE markers in content
- Display block-level properties inline (tags as #tag, others as key::value)

### New Feature
- Add max_depth parameter to control nesting display depth
  - Default: -1 (unlimited, show all levels)
  - max_depth=0: only top-level blocks
  - max_depth=1: parent + immediate children
  - Enables performance optimization for deeply nested pages

### API Changes
- GetPageContentToolHandler.get_tool_description(): Add max_depth input parameter
- GetPageContentToolHandler.run_tool(): Extract and use max_depth parameter
- New static method: _format_block_tree(block, indent_level, max_depth)

## Tests
- Add 5 comprehensive test cases for nested block formatting
- Test 2-level nesting with proper indentation
- Test 3+ level deep nesting
- Test max_depth limiting behavior
- Test markers and properties display
- Test multiple sibling blocks at same level
- All 108 tests passing (+5 from previous 103)

## Example Output

Before (broken):
```
# 2026-01-07

Content:
- DONE Parent task
```

After (fixed):
```
# 2026-01-07

Content:
- DONE Parent task #opensource #contribution priority::high
  - DONE Child task
    - Grandchild detail
```

This ensures users can see complete hierarchical content when reading
pages via the MCP tool, matching their Logseq graph structure.
Remove confusing decorative headers from get_page_content tool output that
were causing LLMs to include them as part of content during update operations.

Changes:
- Replace title header and property labels with YAML frontmatter format
- Remove "# Title" header (page name already passed to tool)
- Remove "Content:" label before blocks
- Convert page properties to YAML frontmatter (---...---)
- Change empty page message from text to "-" (valid Logseq syntax)
- Update tests to check for YAML frontmatter instead of old headers

Benefits:
- LLM-friendly output without ambiguous decorative headers
- Standard YAML frontmatter format (widely recognized)
- Direct usage for update_page operations without confusion
- Clean, parseable output that matches Logseq expectations

Breaking change: Text format output structure changed (acceptable for minor bump)
Properties are now properly stored on the first block of a page using
upsertBlockProperty, which is how Logseq internally handles page properties.

Key changes:
- Set properties AFTER block insertion in create_page_with_blocks
- Set properties AFTER block operations in update_page_with_blocks
- Add property merge logic: append mode merges, replace mode replaces
- Add property value normalization for tags/aliases dicts
- Fix '[object Object]' issue when tags passed as dict

The normalization handles special cases where tags/aliases are passed
as dicts (e.g., {"hello": true, "test": true}) and converts them
to arrays (["hello", "test"]) which Logseq expects.

Fixes property persistence in UI and JSON retrieval.

Test coverage:
- 22 property persistence tests (12 original + 10 normalization)
- All 130 tests passing
- 0 type errors
Logseq API returns properties in two places:
1. In block.content field (as 'property:: value' lines)
2. In block.properties dict

Our code was adding properties a third time from the properties dict,
causing triple duplication in the output.

Changes:
- Remove inline property rendering from _format_block_tree
- Properties now shown once, as they appear in block content
- Page-level properties still shown in YAML frontmatter
- Update test to reflect real Logseq API behavior

Example output now:
---
heading: 1
qa: true
---
- # hello world
  heading:: 1    (shown once, from content)
  qa:: true      (shown once, from content)

Instead of:
- # hello world
  heading:: 1    (from content)
  qa:: true      (from content)
  heading::1 qa::True  (duplicate from our code)

All 130 tests passing.
Since page properties are already included in the first block's content
(as Logseq stores them), we don't need to duplicate them in YAML
frontmatter at the top of the output.

Changes:
- Remove YAML frontmatter rendering from get_page_content text format
- Remove unused yaml import
- Update tests to reflect that properties appear once in content
- Simplify output format

Before:
---
heading: 1
qa: true
---
- # hello world
  heading:: 1
  qa:: true

After:
- # hello world
  heading:: 1
  qa:: true

Properties shown once, as they naturally appear in Logseq's content.
All 130 tests passing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

get page content only returns top level blocks

1 participant