Basic Concepts ============== Novel ----- Refers to a novel which has at least a Table of Contents (can be one or more) and chapters. It also has some metadata that can be saved like author, language, tags, creation or end date, etc. Table of Contents (TOC) ----------------------- Source of Truth for all the chapters the novel will have. It can be from a main URL (it will be requested and saved; if there is more than one page, they will also get requested and saved), or the HTML files can be added directly from a file. All the chapters are autogenerated from this TOC. Chapters -------- A Chapter comes from a URL, is requested and saved as a file on your local machine. Once a file is saved, you will not need to request it anymore. From this chapter you can get the Title and the Chapter Content. Decoder -------- A decoder is a set of rules used to extract information from a chapter—such as links, content, and titles. The **host** key indicates which set of rules to use. Hosts can be manually added or generated from a TOC (Table of Contents) URL. Below is a general example of a decoder configuration: .. code-block:: json { "host": "novelbin.me", "has_pagination": false, "title": { "selector": "h2 a.chr-title", "array": false, "extract": { "type": "attr", "key": "title" } }, "content": { "element": "div", "id": "chr-content", "array": true }, "index": { "selector": "ul.list-chapter li a", "array": true } } Explanation ^^^^^^^^^^^^ Decoders rely on **BeautifulSoup** to parse HTML. The most flexible way to locate elements is by using **selector**, which supports full CSS selectors. However, you can also specify: - **element** (e.g., `"div"`, `"a"`), - **id** (e.g., `"main-content"`), - **class** (e.g., `"chapter-text"`), - **attributes** (e.g., `{"data-type": "title"}`). If **has_pagination** is set to `true`, the decoder may look for a **next_page** key to navigate through multiple TOC pages. The **index** key is used to find the `href` of each chapter in the TOC. The **title** and **content** keys are used to extract the chapter’s title and main text, respectively. Keys and Possible Values ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ A decoder may contain the following main keys: - **has_pagination** Indicates whether the TOC has multiple pages. - **index** Collects all chapter links from the TOC (generally, the `href` attribute from an `` tag). - **next_page** Identifies the URL for the next TOC page if **has_pagination** is `true`. - **title** The chapter title. - **content** The chapter content. Each of these keys can be configured using the structure shown below: .. code-block:: json "key": { "element": null, "id": null, "class": null, "attributes": null, "selector": null, "array": false, "extract": { "type": "text", "key": null } } Where: - **element**: Name of the HTML tag (e.g., `"div"`, `"span"`, `"p"`). - **id**: The `id` attribute of that HTML tag. - **class**: The `class` attribute of that HTML tag. - **attributes**: Any additional attributes (e.g., `{"data-name": "xyz"}`). - **selector**: A CSS selector string (e.g., `"div.chapter-content h2"`). See `BeautifulSoup Documentation `_ for details. - **array**: If `true`, returns a list of matched tags; if `false`, returns a single tag. - **extract**: Defines how data is extracted from the matched tag(s): - `"type": "text"` extracts inner text. - `"type": "attr"` extracts a given attribute specified in `"key"` (e.g., `"href"`, `"title"`). Examples ^^^^^^^^ **Title Extraction** .. code-block:: html

Chapter 1

.. code-block:: json "title": { "selector": "h2 a.chr-title", "array": false, "extract": { "type": "attr", "key": "title" } } - **selector**: `"h2 a.chr-title"` finds the `` tag inside an `

` element with class `chr-title`. - **array**: `false` because we expect a single result. - **extract.type**: `"attr"` and **extract.key**: `"title"` to retrieve the `title` attribute. **Content Extraction** .. code-block:: html
Chapter content goes here...
.. code-block:: json "content": { "element": "div", "id": "chr-content", "array": true } - **element**: `"div"`, and **id**: `"chr-content"` finds the main content. - **array**: `true` to return multiple pieces of content if needed. **TOC Link Extraction** .. code-block:: html
.. code-block:: json "index": { "selector": "ul.list-chapter li a", "array": true } - **selector**: `"ul.list-chapter li a"` to gather all `` tags pointing to chapter URLs. - **array**: `true` because many chapters could be listed.