Decode Guide Reference

This page describes the decode guide format.

The decode guide is a JSON file with one entry per host. Each entry defines how the scraper extracts title, chapter content, TOC links, and optional pagination links.

Where It Is Used

Default file: web_novel_scraper/decode_guide/decode_guide.json
Custom file in CLI: --decode-guide-file /path/to/decode_guide.json

Root Structure

The root JSON value must be a list.

[
        {
                "host": "example.com",
                "title": { ... },
                "content": { ... },
                "index": { ... }
        }
]

Each host is matched by exact string value (case-sensitive).

Top-Level Keys (Per Host Entry)

Required keys

host (string): Exact hostname used to pick this entry (for example novelbin.com).
title (object): Rules to extract chapter title from chapter HTML.
content (object): Rules to extract chapter content from chapter HTML.
index (object): Rules to extract chapter URLs from TOC HTML.

Optional keys

next_page (object)

Rules to extract the next TOC page URL when the TOC has pagination.

title_in_content (YES | NO | SEARCH)

Controls whether the chapter title is prepended to exported content.

YES: always prepend title.
NO: never prepend title.
SEARCH: prepend only if the title is not already in content.

has_pagination (boolean, default false)

Indicates if TOC pages are paginated for this host.

chapters_in_descending_order (boolean, default false)

Set to true when chapter URLs are listed in descending order on the TOC page.

pagination_in_descending_order (boolean, default false)

Set to true when TOC pages are listed in descending order.

add_host_to_chapter (boolean, default false)

If true, each URL extracted from index is prefixed with https://<host>.

toc_main_url_processor (boolean, default false)

Enables custom processing hook for TOC main URL before use.

Section Decoder Keys

These keys are used inside title, content, index, and next_page.

Required shape

Must be defined either:

selector
or one or more selector parts: element, id, class, attributes

Key reference

selector (string): Full CSS selector used by BeautifulSoup select().
element (string): Tag name part used to build selector when selector is not provided.
id (string): ID selector part used to build selector (becomes #id).
class (string): Class selector part used to build selector (becomes .class).
attributes (object): Attribute filters used to build selector. Example: {"data-id": "123", "hidden": null}.
array (boolean): If true, returns all matched values as list. If false or omitted, returns the first match.
extract (object): Defines what to extract from each matched element (text or attribute).
use_custom_processor (boolean): Declares this section should rely on a custom processor. In this mode, this key should be the only key in that section.

Extract Keys

extract.type

Extraction mode:

text: use text content.
attr: use an HTML attribute.

extract.key (string, required when type=attr)

Attribute name to extract (for example href, src).

Selector Fallback With XOR

If selector contains XOR, selectors are tried left-to-right until one returns elements.

{
        "selector": "div.primary p XOR div.fallback p",
        "array": true
}

Defaults Summary

title_in_content -> SEARCH
has_pagination -> false
add_host_to_chapter -> false
chapters_in_descending_order -> false
pagination_in_descending_order -> false

Minimal Valid Example

[
        {
                "host": "example.com",
                "title_in_content": "SEARCH",
                "has_pagination": false,
                "title": {
                       "selector": "h1.chapter-title",
                       "extract": {
           "type": "text"
       }
                },
                "content": {
                        "selector": "div.chapter-content p",
                        "array": true
                },
                "index": {
                        "selector": "ul.chapter-list a",
                        "array": true,
                        "extract": {
           "type": "attr",
           "key": "href"
       }
                }
        }
]

Example With Pagination

[
        {
                "host": "example.com",
                "has_pagination": true,
                "title": {
                        "selector": "h1.chapter-title",
                        "extract": {
           "type": "text"
        }
                },
                "content": {
                        "selector": "div.chapter-content p",
                        "array": true
                },
                "index": {
                        "selector": "ul.chapter-list a",
                        "array": true,
                        "extract": {
           "type": "attr",
           "key": "href"
        }
                },
                "next_page": {
                        "selector": "a.next",
                        "array": false,
                        "extract": {
           "type": "attr",
           "key": "href"
        }
                }
        }
]