Run OCR / Text Recognition On Selected Area On Screen

Description of what you are sharing
Version 5.177 adds a new predefined action "OCR / Recognize / Extract Text from Clipboard Contents / Image". This can nicely be combined with the "Capture Screenshot To Clipboard" action to allow running OCR on a specific area on your screen.

This can be helpful in many scenarios, for example often text is not selectable and cannot easily be copied.

Here is a simple example preset that allows you to capture a part of your screen and automatically run OCR on it and copy the result to your clipboard. It is bound to the ctrl+opt+cmd+O keyboard shortcut (but of course you can use any of BTT triggers to run the same action sequence).
ocr-keyboard-shortcutv2.bttpreset (6.3 KB)

Hey Andreas, this is great.

I've also see that there's a "Join Found Text With String:" config, is there a way instead of joining it with string, to just extract the text like it is seen on the screen/screenshot, e.g. if the text is on two lines the result to be also on two lines.

I've tried to workaround this using "Transform Clipboard Contents with Java Script":

async (clipboardContentString) => {
  return clipboardContentString.replace(/\\n/g, '\n');
}

but in the end this won't work every time, since if I use the "Join Found Text With String" config and join the text with let say "\n" , this would mean that if there's actual "\n" in the text it will replace it with newline.

Consider the following as an example string on which I need to do OCR:

Thank you in advance.

the problem is that the system API's don't give me information based on lines, just the coordinates of the found snippets on screen. This might often be the same as lines, but not always.

I could provide these coordinates in a variable then we can create some script that tries to "join" somehow intelligently

In 5.181 I have added a new option "Try to merge into lines based on screen coordinates".

When you combine this with \nas join character, it might work as you expect! It tries to find lines based on the coordinates.,

Additionally when saving to a variable, BTT will now make a secondary variable with "-coordinates" suffix available that contains JSON describing the detailed layout of the recognized text

Here is an updated preset:
ocr-keyboard-shortcutv2.bttpreset (6.3 KB)

//edit: 5.181 joined in the reverse order 5.182 should fix that

2 Likes

99% of the time it works, but sometimes there's a dot after the end of first line (using the same example as above). Not a big deal.

Is there a way to disable the sound? I'm aware that I can stop it if I disable "Play user interface sound effects" under System Settings/Sound, but if you are using the /usr/sbin/screencapture can you include BTT config for action "Capture Screenshot to Clipboard" to use -x option?

Unfortunately sometimes artifacts like dots or slashes happen, BTT just returns what it gets from the macOS OCR API's. There is no way to further constrain this as far as I know ;-(

Yes, you can use the "Capture Screenshot (Configurable)" action like this:

1 Like

I'm trying to use Polish language, so "pl" in Language field, but it doesn't recognize Polish diacritic characters (ąćęłńóśźżĄĆĘŁŃÓŚŹŻ).

unfortunately it looks like only the following languages are supported by the macOS OCR on Sequoia (via VNRecognizeTextRequest):

["en-US", "fr-FR", "it-IT", "de-DE", "es-ES", "pt-BR", "zh-Hans", "zh-Hant", "yue-Hans", "yue-Hant", "ko-KR", "ja-JP", "ru-RU", "uk-UA", "th-TH", "vi-VT", "ar-SA", "ars-SA"]

So, maybe, it is good to think about using Tesseract OCR engine (GitHub - tesseract-ocr/tesseract: Tesseract Open Source OCR Engine (main repository))? It is open source and recognizes 100 languages.

sorry that won't come to BTT, I have spent a lot of time with Tesseract in the past and it is absolute hell (but of course also super powerful - we used it to read text from photos taken of industrial machine labels ) :slight_smile:

(maybe it got better/easier to use in recent versions, in that case you can use the tesseract command line utility and call it from BTT if required)

Something like this (after installing it with brew install tesseract):

screencapture -i ~/Downloads/screenshot.png && tesseract ~/Downloads/screenshot.png /tmp/ocr_output && cat /tmp/ocr_output.txt

Thanks Andreas for adding this new OCR / Text Recognition on Selected Area On Screen Action!

@Andreas_Hegenberg Would you consider adding an option or a new Action to perform OCR using OpenAI's vision API? Users could provide their own API key to use it.

In case it's helpful to anyone, a while back I created a simple CLI in Python to easily extract text from a screenshot. What makes it really helpful to me is that I can pass a --method argument to specify if I want to use tesseract or openai for OCR using the vision API.

The benefit of the openai option is that it's excellent at extracting complicated text, preserving line breaks and indentation. For example, it works great when copying code from a tutorial on a YouTube video.

The benefit of the tesseract method is that it's free and runs all locally. For example, it works great for copying URLs from YouTube videos.

The drawbacks of the openai option is that it costs a fraction of a cent per use and you have to trust OpenAI with your data. Their privacy policy states that they will delete the image but I assume everything that I send them will be used for model training.

I can trigger this CLI using keyboard shortcuts using BTT.

To use the tesseract method:

python ~/projects/screenshot/ocr_tesseract_or_openai.py --method tesseract

To use the openai method:

python ~/projects/screenshot/ocr_tesseract_or_openai.py --method openai

Here's the ocr_tesseract_or_openai.py script:

import os
from pathlib import Path
from datetime import datetime
import base64
import argparse
from PIL import Image
import pytesseract
import pyperclip
from openai import OpenAI, OpenAIError
import instructor
from pydantic import BaseModel, Field
from rich import pretty
from rich.console import Console
from rich.traceback import install
import shutil
import logging
from rich.logging import RichHandler

# Install rich traceback handler
install(show_locals=True)
pretty.install()

# Initialize console
console = Console()

# Configure logging
HOME_PATH = Path.home()

# Ensure the screenshots directory exists
SCREENSHOT_DIR = HOME_PATH / "projects/screenshots"
SCREENSHOT_DIR.mkdir(parents=True, exist_ok=True)

LOG_FILE_PATH = SCREENSHOT_DIR / "ocr_script.log"


# Create a custom logger
logger = logging.getLogger("ocr_script")
logger.setLevel(logging.ERROR)  # Capture all levels of logs

# Create handlers
console_handler = RichHandler(rich_tracebacks=True)
console_handler.setLevel(logging.INFO)  # Set to INFO to avoid debug logs on console

file_handler = logging.FileHandler(LOG_FILE_PATH)
file_handler.setLevel(logging.DEBUG)  # Capture all logs in file

# Create formatters and add them to handlers
console_formatter = logging.Formatter("%(message)s", datefmt="[%X]")
file_formatter = logging.Formatter(
    "%(asctime)s - %(name)s - %(levelname)s - %(message)s",
    datefmt="[%Y-%m-%d %H:%M:%S]",
)
console_handler.setFormatter(console_formatter)
file_handler.setFormatter(file_formatter)

# Add handlers to the logger
logger.addHandler(console_handler)
logger.addHandler(file_handler)


# Function to encode the image
def encode_image(image_path):
    logger.debug("Function encode_image(image_path=%s) called.", image_path)
    with open(image_path, "rb") as image_file:
        encoded = base64.b64encode(image_file.read()).decode("utf-8")
    logger.debug("Image encoded successfully.")
    return encoded


def take_screenshot():
    logger.debug("Function take_screenshot() called.")
    if shutil.which("screencapture") is None:
        logger.error("The 'screencapture' command is not available on this system.")
        return None

    timestamp = datetime.now().strftime("%Y%m%d_%H_%M_%S")
    screenshot_path = SCREENSHOT_DIR / f"{timestamp}_screenshot.png"
    try:
        # macOS command for an interactive screenshot
        os.system(f"screencapture -i {screenshot_path}")
        # Check if the screenshot was taken (i.e., the file exists)
        if not os.path.exists(screenshot_path):
            logger.warning("Screenshot was canceled.")
            return None
        logger.info(f"Screenshot saved to {screenshot_path}")
        return screenshot_path
    except Exception:
        logger.exception("Failed to take screenshot:")
        return None


# Pydantic model to represent the text extracted from an image
class ExtractedText(BaseModel):
    text: str = Field(..., description="The text extracted from the image")


def extract_text_with_openai(image_path):
    logger.debug("Function extract_text_with_openai(image_path=%s) called.", image_path)
    base64_image = encode_image(image_path)

    # Patch the OpenAI client
    client = instructor.from_openai(OpenAI())

    try:
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "text",
                            "text": "Extract the text from this image. If the text contains code, preserve the formatting.",
                        },
                        {
                            "type": "image_url",
                            "image_url": {
                                "url": f"data:image/png;base64,{base64_image}"
                            },
                        },
                    ],
                }
            ],
            max_tokens=2000,
            response_model=ExtractedText,
        )
        logger.info("Text extracted using OpenAI.")
        logger.debug("Extracted text: %s", response.text)
        return response.text
    except OpenAIError:
        logger.exception("Failed to extract text using OpenAI:")
        return None


def extract_text_with_pytesseract(image_path):
    logger.debug(
        "Function extract_text_with_pytesseract(image_path=%s) called.", image_path
    )
    try:
        image = Image.open(image_path)
        # If Tesseract is not in PATH, specify the full path:
        pytesseract.pytesseract.tesseract_cmd = r"/opt/homebrew/bin/tesseract"
        text = pytesseract.image_to_string(image, config="--oem 3 --psm 6")
        logger.info("Text extracted using pytesseract.")
        logger.debug("Extracted text: %s", text)
        return text
    except Exception:
        logger.exception("Failed to extract text using pytesseract:")
        return None  # Return None instead of the exception object


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="OCR Script")
    parser.add_argument(
        "--method",
        choices=["openai", "tesseract"],
        default="tesseract",
        help="Specify the OCR method to use: 'openai' or 'tesseract'. Default is 'tesseract'.",
    )
    args = parser.parse_args()

    # Log script start with arguments
    logger.info("*" * 80)
    logger.info("Script called with arguments: %s", args)
    logger.info("*" * 80)

    image_path = take_screenshot()
    if image_path:
        if args.method == "openai":
            extracted_text = extract_text_with_openai(image_path)
        else:
            extracted_text = extract_text_with_pytesseract(image_path)

        if extracted_text is not None:
            print(extracted_text, end="\n\n")
            pyperclip.copy(extracted_text)
            logger.info("Text has been copied to the clipboard.")
        else:
            logger.warning("No text extracted. Exiting.")
    else:
        logger.warning("No screenshot taken. Exiting.")

Great idea!
I added a "Attach Image File" option to BTT's ChatGPT actions in 5.196.

1 Like

Very nice!

[
  {
    "BTTActionCategory" : 0,
    "BTTLastUpdatedAt" : 1739441494.4997101,
    "BTTTriggerType" : 0,
    "BTTTriggerClass" : "BTTTriggerTypeKeyboardShortcut",
    "BTTUUID" : "FBB9FCE3-79B7-4B0B-AFBB-A2495CB3265F",
    "BTTPredefinedActionType" : 366,
    "BTTPredefinedActionName" : "Empty Placeholder",
    "BTTAdditionalConfiguration" : "8388608",
    "BTTKeyboardShortcutKeyboardType" : 2302,
    "BTTTriggerOnDown" : 1,
    "BTTLayoutIndependentChar" : "F11",
    "BTTEnabled" : 1,
    "BTTEnabled2" : 1,
    "BTTShortcutKeyCode" : 103,
    "BTTShortcutModifierKeys" : 8388608,
    "BTTOrder" : 10,
    "BTTAutoAdaptToKeyboardLayout" : 0,
    "BTTAdditionalActions" : [
      {
        "BTTActionCategory" : 0,
        "BTTLastUpdatedAt" : 1739441492.1690221,
        "BTTTriggerParentUUID" : "FBB9FCE3-79B7-4B0B-AFBB-A2495CB3265F",
        "BTTIsPureAction" : true,
        "BTTTriggerClass" : "BTTTriggerTypeKeyboardShortcut",
        "BTTUUID" : "AA8723EC-2856-4AAD-8A43-E462ACAF11EA",
        "BTTPredefinedActionType" : 500,
        "BTTPredefinedActionName" : "Capture Screenshot to Clipboard",
        "BTTKeyboardShortcutKeyboardType" : 0,
        "BTTEnabled" : 1,
        "BTTEnabled2" : 1,
        "BTTShortcutKeyCode" : -1,
        "BTTOrder" : 1,
        "BTTAutoAdaptToKeyboardLayout" : 0
      },
      {
        "BTTActionCategory" : 0,
        "BTTLastUpdatedAt" : 1739441492.171047,
        "BTTTriggerParentUUID" : "FBB9FCE3-79B7-4B0B-AFBB-A2495CB3265F",
        "BTTIsPureAction" : true,
        "BTTTriggerClass" : "BTTTriggerTypeKeyboardShortcut",
        "BTTUUID" : "D017E841-F8E7-464B-88DD-7D31B1F369D1",
        "BTTPredefinedActionType" : 129,
        "BTTPredefinedActionName" : "Pause Execution  or  Delay Next Action (blocking)",
        "BTTDelayNextActionBy" : "0.5",
        "BTTKeyboardShortcutKeyboardType" : 0,
        "BTTEnabled" : 1,
        "BTTEnabled2" : 1,
        "BTTShortcutKeyCode" : -1,
        "BTTOrder" : 2,
        "BTTAutoAdaptToKeyboardLayout" : 0
      },
      {
        "BTTActionCategory" : 0,
        "BTTLastUpdatedAt" : 1739441742.9538479,
        "BTTTriggerParentUUID" : "FBB9FCE3-79B7-4B0B-AFBB-A2495CB3265F",
        "BTTIsPureAction" : true,
        "BTTTriggerClass" : "BTTTriggerTypeKeyboardShortcut",
        "BTTUUID" : "F96CC5A8-A918-460C-BD75-3636D87FD81B",
        "BTTPredefinedActionType" : 471,
        "BTTPredefinedActionName" : "Use ChatGPT (Optionally on Selected Text). Copy Result to Clipboard  or  Variable.",
        "BTTAdditionalActionData" : {
          "BTTChatGPTTransformerShowMiniHUD" : true,
          "BTTChatGPTTransformerExampleInput" : "Some Test Text",
          "BTTChatGPTUseStreaming" : true,
          "BTTChatGPTKeepSelectedText" : true,
          "BTTChatGPTNumberOfHistoryItems" : 0,
          "BTTChatGPTTransformerUserPrompt" : "Extract all of the text from the screenshot.",
          "BTTChatGPTModel" : "gpt-4o",
          "BTTChatGPTAppendImage" : 1,
          "BTTChatGPTTransformerSystemPrompt" : "You an an expert OCR AI assistant. Your purpose is to extract all text from provided images and screenshots.",
          "BTTChatGPTCopyResponseToVariableName" : "BTTChatGPTResponse",
          "BTTChatGPTAppendSelectedText" : true,
          "BTTChatGPTImageType" : 0,
          "BTTChatGPTCopyResponseToClipboard" : true,
          "BTTChatGPTCopyResponseToVariable" : true,
          "BTTChatGPTModelSelection" : "custom"
        },
        "BTTKeyboardShortcutKeyboardType" : 0,
        "BTTEnabled" : 1,
        "BTTEnabled2" : 1,
        "BTTShortcutKeyCode" : -1,
        "BTTOrder" : 3,
        "BTTAutoAdaptToKeyboardLayout" : 0
      }
    ]
  }
]

This is awesome. I have been using Textinator but this seems orders of magnitude faster so far :fist_right:t2:

Oh, your implementation is way more elegant than my script below. You really are a great developer! Can you teach me how you did it?

Here’s my script:

#!/usr/bin/env bash

Save screenshot to specified path

screencapture -i /tmp/screenshot.png

Use qlmanage to open the screenshot and bring it to the front.

qlmanage -p /tmp/screenshot.png >/dev/null 2>&1 &

Automatically copy screenshots to the clipboard.

osascript <<EOF
set imageFile to POSIX file "/tmp/screenshot.png"
set theClipboard to (read imageFile as JPEG picture)
set the clipboard to theClipboard
EOF

Optimize: Quickly check qlmanage and keep it on top.

osascript <<EOF
tell application "System Events"
set timeoutDate to (current date) + 2 -- Shorten the timeout period.
repeat while ((current date) < timeoutDate)
delay 0.1 -- Adjust the detection interval to avoid high CPU usage.
if ((name of processes) contains "qlmanage") then
tell application process "qlmanage"
set frontmost to true
end tell
exit repeat
end if
end repeat
end tell
EOF

Hey Andreas, is there a way to not keep the captured screenshot in the clipboard? I'm noticing something strange - if I do OCR in MS Teams for example - no screenshot in clipboard, but if I do OCR in iTerm2 or MS Edge (I guess and other apps) - both the screenshot and the recognized text are saved.

do you maybe have BTT's clipboard manager disabled for teams? I just tried but it is saving the screenshot correctly here.

Possibly you could get rid of the screenshot by adding these two actions:

I think I have not disabled BTT's clipboard manager for MS Teams since I can copy/paste from/in MS Teams. If it was disabled how the OCR will work?

I've tried these two actions, but either I have not configured it correctly or it doesn't work:

the OCR action uses the content of your system clipboard, thus it doesn't matter if it is in BTT's clipboard manager or not. That's why in theory disabling the observation should be fine.

I tried by modifying the preset shared above:

Which seems to work ok here.

Could you export your action sequence? Then I can try here

Is there maybe a general confusion about BTT's clipboard manager? Even if not using BTT's clipboard manager you should always be able to copy and paste. BTT's clipboard manager is only for entries that have been copied in the past. For the most current copy always the standard macOS copy/paste is used.

To see whether BTT's clipboard manager is disabled for Teams you can check Teams in the side bar (if you don't have Teams in the Side Bar, it can not be disabled)

I had to move the "Pause Until Clipboard Changes / Wait For Change Of Clipboard Contents" action between the Disable and Enable actions, and now no screenshot is saved no matter the app (I'm not sure why this had this effect - if you have time you can share quick tl;dr). Thank you!

This is clear:

Even if not using BTT's clipboard manager you should always be able to copy and paste

I was not aware of this:

For the most current copy always the standard macOS copy/paste is used.

The Clipaboard Manager is not disabled for MS Teams (I do have it in the Side bar). If you think this is worth investigating, and if there's a way, I can assist.

1 Like