Basic Concepts

Novel

Refers to a novel which has at least a Table of Contents (can be one or more) and chapters. It also has some metadata that can be saved like author, language, tags, creation or end date, etc.

Table of Contents (TOC)

Source of Truth for all the chapters the novel will have. It can be from a main URL (it will be requested and saved; if there is more than one page, they will also get requested and saved), or the HTML files can be added directly from a file. All the chapters are autogenerated from this TOC.

Chapters

A Chapter comes from a URL, is requested and saved as a file on your local machine. Once a file is saved, you will not need to request it anymore. From this chapter you can get the Title and the Chapter Content.

Decoder

A decoder is a set of rules used to extract information from a chapter—such as links, content, and titles. The host key indicates which set of rules to use. Hosts can be manually added or generated from a TOC (Table of Contents) URL.

Below is a general example of a decoder configuration:

{
    "host": "novelbin.me",
    "has_pagination": false,
    "title": {
        "selector": "h2 a.chr-title",
        "array": false,
        "extract": {
            "type": "attr",
            "key": "title"
        }
    },
    "content": {
        "element": "div",
        "id": "chr-content",
        "array": true
    },
    "index": {
        "selector": "ul.list-chapter li a",
        "array": true
    }
}

Explanation

Decoders rely on BeautifulSoup to parse HTML. The most flexible way to locate elements is by using selector, which supports full CSS selectors. However, you can also specify:

  • element (e.g., “div”, “a”),

  • id (e.g., “main-content”),

  • class (e.g., “chapter-text”),

  • attributes (e.g., {“data-type”: “title”}).

If has_pagination is set to true, the decoder may look for a next_page key to navigate through multiple TOC pages. The index key is used to find the href of each chapter in the TOC. The title and content keys are used to extract the chapter’s title and main text, respectively.

Keys and Possible Values

A decoder may contain the following main keys:

  • has_pagination Indicates whether the TOC has multiple pages.

  • index Collects all chapter links from the TOC (generally, the href attribute from an <a> tag).

  • next_page Identifies the URL for the next TOC page if has_pagination is true.

  • title The chapter title.

  • content The chapter content.

Each of these keys can be configured using the structure shown below:

"key": {
    "element": null,
    "id": null,
    "class": null,
    "attributes": null,
    "selector": null,
    "array": false,
    "extract": {
        "type": "text",
        "key": null
    }
}

Where:

  • element: Name of the HTML tag (e.g., “div”, “span”, “p”).

  • id: The id attribute of that HTML tag.

  • class: The class attribute of that HTML tag.

  • attributes: Any additional attributes (e.g., {“data-name”: “xyz”}).

  • selector: A CSS selector string (e.g., “div.chapter-content h2”). See BeautifulSoup Documentation for details.

  • array: If true, returns a list of matched tags; if false, returns a single tag.

  • extract: Defines how data is extracted from the matched tag(s): - “type”: “text” extracts inner text. - “type”: “attr” extracts a given attribute specified in “key” (e.g., “href”, “title”).

Examples

Title Extraction

<h2>
  <a class="chr-title"
     href="https://url-of-the-chapter"
     title="Chapter 1">
     <span class="chr-text">Chapter 1</span>
  </a>
</h2>
"title": {
    "selector": "h2 a.chr-title",
    "array": false,
    "extract": {
        "type": "attr",
        "key": "title"
    }
}
  • selector: “h2 a.chr-title” finds the <a> tag inside an <h2> element with class chr-title.

  • array: false because we expect a single result.

  • extract.type: “attr” and extract.key: “title” to retrieve the title attribute.

Content Extraction

<div id="chr-content" class="chr-c" style="...">
  Chapter content goes here...
</div>
"content": {
    "element": "div",
    "id": "chr-content",
    "array": true
}
  • element: “div”, and id: “chr-content” finds the main content.

  • array: true to return multiple pieces of content if needed.

TOC Link Extraction

<ul class="list-chapter">
  <li>
    <a href="https://url-of-chapter-1" title="Chapter 1">
      <span class="nchr-text chapter-title">Chapter 1</span>
    </a>
  </li>
</ul>
"index": {
    "selector": "ul.list-chapter li a",
    "array": true
}
  • selector: “ul.list-chapter li a” to gather all <a> tags pointing to chapter URLs.

  • array: true because many chapters could be listed.