diff --git a/README.md b/README.md index c6af9d5..1f45c3a 100644 --- a/README.md +++ b/README.md @@ -135,6 +135,12 @@ After analysis, generate a single Markdown report: - `python3 -m sergio_instagram_messaging.generate_dm_report --analysis-dir /root/tmp/socialmediatorr-agent-analysis` +### Plain-English deep report (Mermaid diagrams) + +Generate the deeper “no raw quotes” report directly from an Instagram export folder: + +- `python3 -m sergio_instagram_messaging.generate_dm_report_detailed --export-input /path/to/export-root --out /root/tmp/dm_history_report_en_detailed.md` + ## Webhooks (new messages → auto-reply) Meta webhooks are two steps: diff --git a/reports/socialmediatorr/dm_history_report_en_detailed.md b/reports/socialmediatorr/dm_history_report_en_detailed.md new file mode 100644 index 0000000..e8902c1 --- /dev/null +++ b/reports/socialmediatorr/dm_history_report_en_detailed.md @@ -0,0 +1,387 @@ +# Socialmediatorr Instagram DM History : Plain-English Deep Report + +## DM History Deep Report + +**Subject:** Instagram direct messages for `@socialmediatorr` +**Version:** v1.0 (STYLE BIBLE EN 3.0GM) +**Date:** 2025-12-24 +**Status:** REVIEW REQUIRED +**Citation:** `if://report/socialmediatorr/instagram/dm-history/` +**Author:** Danny Stocker | InfraFabric Research + +### How This Report Was Made + +> This is an automated count of patterns. It is not a therapy note and it is not a sales ledger. + +This document was generated by reading an Instagram data export and counting repeat patterns over time. It avoids quoting private client messages and it avoids storing personal identities. + +--- + +**Context:** This inbox contains a high-volume message-and-reply system over 429 days. + +> Your messaging system is working as a volume engine. The weak point is consistency at the moments where people ask to buy or book. + +The purpose of this report is practical: define what to keep, what to remove, and what to automate safely—without damaging trust. + +### What Happened + +> This export shows a dense campaign period, not a quiet inbox. + +Across the observed window, you sent a very large number of messages and you received a smaller number of replies back. That is normal when you are messaging lots of people: many contacts, fewer responders. + +| Metric | Value | Source | +|---|---:|---| +| Observation window (earliest → latest) | 2024-10-20 → 2025-12-22 | Instagram export | +| Total messages | 54,069 | Instagram export | +| Messages you sent | 43,607 | Instagram export | +| Messages people sent you | 10,462 | Instagram export | +| Messages that look like a question or a request | 2,713 | Instagram export | +| System messages about new followers (auto text in the inbox) | 8,081 | Instagram export | + +### What You Need to Know (In Plain English) + +> If you only read one section, read this one. + +This inbox has a few dominant patterns. They tell you what an auto-reply system must be good at, and where it must hand off to a human. + +| Thing to know | Why it matters | +|---|---| +| Most activity happened in **2025-12** | 47,315 messages in one month (87.5% of all messages in this export) | +| The #1 question/topic is **Just one word: book** | 1,857 times (68.4% of all questions/requests) | +| Questions/requests cluster on **Thursday, Friday** | Those two days contain most of the asking in this export | +| Most messages arrive in **18:00-23:59, 12:00-17:59 (UTC)** | If you are present in those blocks, reply rates usually improve | +| Repeat messages make up **67.6%** of your text messages | Fast replies are often repeats; custom replies are where delays happen | +| Language used by people (approx) | Spanish 25.7%, English 18.1%, Unknown 56.2% | +| Language used in your messages (approx) | Spanish 63.8%, English 29.2%, Unknown 7.0% | +| Custom-reply slow end | 90% of custom replies are faster than **16h 36m** | + +### Key Patterns Over Time + +> The month-by-month shape is uneven. There are clear bursts. + +To avoid guesswork, we start with 3-month blocks (a simple way to smooth noise), then we go month-by-month. + +| 3-month block | Messages from people | Messages you sent | Questions/requests | +|---|---:|---:|---:| +| 2024 Oct-Dec | 14 | 0 | 0 | +| 2025 Jan-Mar | 21 | 0 | 0 | +| 2025 Apr-Jun | 93 | 100 | 16 | +| 2025 Jul-Sep | 622 | 879 | 88 | +| 2025 Oct-Dec | 9,712 | 42,628 | 2,609 | + +Same data as charts: + +```mermaid +pie title Messages From People by 3-Month Block + "2024 Oct-Dec" : 14 + "2025 Jan-Mar" : 21 + "2025 Apr-Jun" : 93 + "2025 Jul-Sep" : 622 + "2025 Oct-Dec" : 9712 +``` + +This shows when people replied most. A spike here usually means you posted something or you asked people to DM you. + +```mermaid +pie title Messages You Sent by 3-Month Block + "2025 Apr-Jun" : 100 + "2025 Jul-Sep" : 879 + "2025 Oct-Dec" : 42628 +``` + +This shows when you sent the most messages. A spike here is effort; the question is how many people replied back. + +### Month by Month (The Real Shape) + +> One big month dominates. Treat earlier months as less reliable. + +This month-by-month table is the clearest view of how the inbox changed over time in this export. + +| Month | Messages from people | Messages you sent | Questions/requests | Questions answered within 48 hours | +|---|---:|---:|---:|---:| +| 2024-10 | 3 | 0 | 0 | n/a | +| 2024-11 | 4 | 0 | 0 | n/a | +| 2024-12 | 7 | 0 | 0 | n/a | +| 2025-01 | 14 | 0 | 0 | n/a | +| 2025-02 | 2 | 0 | 0 | n/a | +| 2025-03 | 5 | 0 | 0 | n/a | +| 2025-04 | 8 | 5 | 0 | n/a | +| 2025-05 | 48 | 28 | 8 | 12.5% | +| 2025-06 | 37 | 67 | 8 | 87.5% | +| 2025-07 | 145 | 319 | 36 | 63.9% | +| 2025-08 | 193 | 230 | 28 | 50.0% | +| 2025-09 | 284 | 330 | 24 | 20.8% | +| 2025-10 | 787 | 1,190 | 64 | 17.2% | +| 2025-11 | 854 | 2,194 | 149 | 46.3% | +| 2025-12 | 8,071 | 39,244 | 2,396 | 89.6% | + +The busiest month was **2025-12** with **47,315** messages total (87.5% of everything in this export). That single month dominates the shape of the data. + +### Days People Reply + +> The best day to follow up is the day people already reply. + +Use this to time follow-ups and first messages. Do not spread effort evenly across the week. + +| Day of week | Messages from people | Messages you sent | Questions/requests | +|---|---:|---:|---:| +| Monday | 1,627 | 8,547 | 140 | +| Tuesday | 1,952 | 9,622 | 189 | +| Wednesday | 1,242 | 5,396 | 155 | +| Thursday | 2,349 | 7,126 | 1,340 | +| Friday | 1,610 | 5,494 | 728 | +| Saturday | 840 | 3,579 | 88 | +| Sunday | 842 | 3,843 | 73 | + +Same data as a chart: + +```mermaid +pie title Messages From People by Day of Week + "Monday" : 1627 + "Tuesday" : 1952 + "Wednesday" : 1242 + "Thursday" : 2349 + "Friday" : 1610 + "Saturday" : 840 + "Sunday" : 842 +``` + +### Time of Day People Reply + +> Most replies happen in a few time blocks. + +Time zone here is UTC (a standard clock). If you work in another time zone, shift the blocks before you schedule. + +| Time of day (UTC) | Messages from people | Messages you sent | +|---|---:|---:| +| 00:00-05:59 | 1,885 | 8,304 | +| 06:00-11:59 | 1,374 | 6,889 | +| 12:00-17:59 | 3,092 | 12,937 | +| 18:00-23:59 | 4,111 | 15,477 | + +Same data as a chart: + +```mermaid +pie title Messages From People by Time of Day (UTC) + "00:00-05:59" : 1885 + "06:00-11:59" : 1374 + "12:00-17:59" : 3092 + "18:00-23:59" : 4111 +``` + +### Reply Speed (Why It Matters) + +> Speed changes the feeling of safety. + +When someone asks a question, the clock starts. A short, direct acknowledgment often beats a perfect answer that arrives too late. + +One caution: “fast replies” are often repeat messages. This section shows overall speed, then splits it into repeat messages vs custom messages. + +| Metric | Value | Source | +|---|---:|---| +| Typical time to reply | 4 seconds | Instagram export | +| Slow end (90% are faster) | 34 seconds | Instagram export | +| Typical time to reply to questions/requests | 2 seconds | Instagram export | +| Slow end for questions/requests (90% are faster) | 4 seconds | Instagram export | +| Messages from people answered within 48 hours | 7,467 (71.4%) | Instagram export | +| Questions/requests answered within 48 hours | 2,278 (84.0%) | Instagram export | + +Breakdown by message type (repeat messages vs custom messages): + +| Type of message you sent (text only) | Count | Typical reply speed | Slow end (90% are faster) | +|---|---:|---:|---:| +| Repeat messages | 18,860 | 4 seconds | 32 seconds | +| Custom messages | 9,040 | 12 seconds | 16h 36m | +| No text (media/reactions) | 15,707 | n/a | n/a | + +| Type of message you sent (questions only) | Typical reply speed | +|---|---:| +| Repeat messages | 2 seconds | +| Custom messages | 9 minutes | + +### Language Mix (What Language People Use) + +> Matching the other person’s language increases trust and reduces back-and-forth. + +This is an approximate language guess based on the text itself. Short one-word messages are harder to classify and may show up as “Unknown”. + +| Language | Messages from people | Messages you sent (text only) | +|---|---:|---:| +| Spanish | 2,662 (25.7%) | 17,804 (63.8%) | +| English | 1,870 (18.1%) | 8,136 (29.2%) | +| Unknown | 5,814 (56.2%) | 1,961 (7.0%) | + +Language split chart (messages from people): + +```mermaid +pie title Language Split (Messages From People) + "Spanish" : 2662 + "English" : 1870 + "Unknown" : 5814 +``` + +Language split chart (messages you sent): + +```mermaid +pie title Language Split (Messages You Sent) + "Spanish" : 17804 + "English" : 8136 + "Unknown" : 1961 +``` + +### Top 20 Things People Ask or Type (Ranked) + +> People repeat the same questions. This is the easiest thing to standardize. + +This list is grouped by meaning (not by exact wording). It includes very short requests (sometimes a single word). + +| Rank | Topic (plain English) | Count | Share of all questions/requests | +|---:|---|---:|---:| +| 1 | Just one word: book | 1,857 | 68.4% | +| 2 | What is this? | 206 | 7.6% | +| 3 | Can you send the video? | 191 | 7.0% | +| 4 | Other question | 120 | 4.4% | +| 5 | Can you help me? | 74 | 2.7% | +| 6 | Can you send the link? | 61 | 2.2% | +| 7 | What does it cost? | 53 | 2.0% | +| 8 | Is this therapy? | 44 | 1.6% | +| 9 | Where do I get the book? | 36 | 1.3% | +| 10 | I can’t find it / it didn’t arrive | 26 | 1.0% | +| 11 | How do I book a call? | 11 | 0.4% | +| 12 | How do I start? | 10 | 0.4% | +| 13 | Can we talk on WhatsApp? | 7 | 0.3% | +| 14 | How does it work? | 5 | 0.2% | +| 15 | What are the steps? | 4 | 0.1% | +| 16 | Is this real? | 4 | 0.1% | +| 17 | Is it free? | 2 | 0.1% | +| 18 | Can I get a refund? | 1 | 0.0% | +| 19 | How long does it take? | 1 | 0.0% | + +In plain terms: **1,893** of **2,713** questions/requests are about the book (69.8%). + +```mermaid +pie title Questions/Requests: Book vs Everything Else + "Book" : 1893 + "Everything else" : 820 +``` + +### Content Patterns (What You Mention When You Sell) + +> Content is not random. It leaves fingerprints in the inbox. + +We track certain words over time that usually show up when you are giving someone a next step (book, video, WhatsApp, call, payment, etc). This lets you see what dominated each period, without reading private conversations. + +| Word found in your messages | Mentions | Source | +|---|---:|---| +| Book | 1,915 | Instagram export (message text) | +| Ebook | 1,912 | Instagram export (message text) | +| Video | 1,124 | Instagram export (message text) | +| Workshop | 383 | Instagram export (message text) | +| Call | 32 | Instagram export (message text) | +| Course | 31 | Instagram export (message text) | +| YouTube | 18 | Instagram export (message text) | +| Training | 13 | Instagram export (message text) | +| Calendly | 7 | Instagram export (message text) | +| WhatsApp | 6 | Instagram export (message text) | + +```mermaid +flowchart LR + Q_2024_Q4["2024 Oct-Dec: No signals"] + Q_2025_Q1["2025 Jan-Mar: No signals"] + Q_2025_Q2["2025 Apr-Jun: Video (4), Training (3), Course (2)"] + Q_2025_Q3["2025 Jul-Sep: Video (28), Training (6), Platform (1)"] + Q_2025_Q4["2025 Oct-Dec: Book (1915), Ebook (1912), Video (1092)"] + Q_2024_Q4 --> Q_2025_Q1 + Q_2025_Q1 --> Q_2025_Q2 + Q_2025_Q2 --> Q_2025_Q3 + Q_2025_Q3 --> Q_2025_Q4 +``` + +This diagram is a high-level view of what you talked about most in each period (based on those words). + +### Follow-Ups (When People Do Not Reply) + +> Silence is where most conversations die. + +This section measures the time between two messages you sent in a row **when the other person did not reply in between**. Very short gaps are usually multi-part scripts. Longer gaps are true follow-ups. + +| Time gap between two messages you sent | Count | Share | +|---|---:|---:| +| Under 1 minute | 23,579 | 85.6% | +| 1-10 minutes | 2,656 | 9.6% | +| 10-60 minutes | 266 | 1.0% | +| 1-6 hours | 18 | 0.1% | +| 6-24 hours | 899 | 3.3% | +| 1-3 days | 15 | 0.1% | +| Over 3 days | 128 | 0.5% | + +Same data as a chart: + +```mermaid +pie title Time Between Two Messages You Sent (No Reply In Between) + "Under 1 minute" : 23579 + "1-10 minutes" : 2656 + "10-60 minutes" : 266 + "1-6 hours" : 18 + "6-24 hours" : 899 + "1-3 days" : 15 + "Over 3 days" : 128 +``` + +### Recommended Actions (Concrete, Ranked) + +> Most improvements are not “more messages.” They are better timing and cleaner answers. + +These actions are intentionally practical. Each one can be implemented without changing your tone. + +| Priority | Action | Why it matters | How to check it worked | +|---:|---|---|---| +| 1 | Write 20 ready-made answers for the Top 20 list | Stops delays and confusion on the most repeated questions | Fewer follow-up messages like “I can’t find it” and faster replies on question days | +| 2 | Fix the book steps from start to finish (book → link → 1 question) | The book dominates the inbox; this must have no extra steps | People stop asking twice; fewer broken-link complaints | +| 3 | Add one short “I saw this” reply for busy hours | Keeps the person engaged even if you are not ready to write a full reply | More conversations continue instead of going silent | +| 4 | Be present on the top question days | Most questions cluster on a small set of days | Higher question reply rate on those days | +| 5 | Mirror language by default (English ↔ Spanish) | Reduces misunderstandings and builds trust | Fewer “What?” / “Explain” messages; more smooth back-and-forth | +| 6 | Add a follow-up rule when people go silent | Many sales die in silence; a simple follow-up rescues them | More replies after 24–48 hours; fewer dead threads | +| 7 | Add a “can’t find it / didn’t arrive” reply | This moment often creates distrust | Complaints resolve quickly; fewer repeated requests | +| 8 | Keep deep clinical discussion out of DMs | DMs are a bad place for nuance and risk | Shorter DM threads; more booked calls when needed | +| 9 | Add hard safety rules (crisis → human) | Automation must never handle high-risk situations | Fewer risky back-and-forth messages; clear handoff | +| 10 | Connect real outcomes (payments/bookings) to a tracker | Without it, you can’t tell what actually worked | You can answer: “Which messages lead to paid outcomes?” | + +### What You Do Not Need to Know + +> Privacy keeps you safe. You do not need to memorize people to help them. + +To run an auto-reply system safely, you only need patterns, not personal identities. Storing extra detail increases risk without improving results. + +| Do not store | Why it is unnecessary | +|---|---| +| Names, handles, phone numbers, emails | Not needed for pattern-based replies; increases privacy risk | +| Full message transcripts for every thread | You only need grouped question themes + best replies | +| Photos/videos/audio attachments | High risk and high storage cost; not required for reply style | +| One-off edge cases | Build rules for repeated situations; escalate rare cases to a human | + +### Decision Map for Safe Auto-Replies + +> A safe system is not smart. It is consistent. + +This is a simple decision map that an auto-reply system can follow. The goal is to answer quickly, stay consistent, and avoid risky situations. + +```mermaid +flowchart LR + A1["New message"] --> A2{"What is it?"} + A2 -->|"Price"| B1["Answer the price; ask if they want the link or a call"] + A2 -->|"Link"| B2["Send the link; ask one yes/no question"] + A2 -->|"Book"| B3["Explain what the book is; ask what they want to change"] + A2 -->|"Book a call"| B4["Send booking steps; ask their time zone"] + A2 -->|"Trust"| B5["Give one simple proof; suggest a call if needed"] + A2 -->|"Other"| B6["Short answer; ask one clarifying question"] + B1 --> C1["If no reply: follow up on a high-reply day"] + B2 --> C1 + B3 --> C1 + B4 --> C1 + B5 --> C1 + B6 --> C1 +``` + +A system like this can cover most conversations without pretending to be a therapist in DM. + diff --git a/sergio_instagram_messaging/generate_dm_report_detailed.py b/sergio_instagram_messaging/generate_dm_report_detailed.py new file mode 100644 index 0000000..9e6f307 --- /dev/null +++ b/sergio_instagram_messaging/generate_dm_report_detailed.py @@ -0,0 +1,1244 @@ +from __future__ import annotations + +import argparse +import json +import os +import re +import statistics +from collections import Counter, defaultdict +from dataclasses import dataclass +from datetime import datetime, timedelta, timezone +from pathlib import Path +from typing import Any, Iterable, Literal + +from .analyze_instagram_export import canonicalize_text + + +def _safe_chmod_600(path: Path) -> None: + try: + os.chmod(path, 0o600) + except Exception: + return + + +def _dt_from_ts_ms(ts_ms: int) -> datetime: + return datetime.fromtimestamp(ts_ms / 1000.0, tz=timezone.utc) + + +def _iso(ts_ms: int | None) -> str | None: + if not ts_ms: + return None + return _dt_from_ts_ms(ts_ms).replace(microsecond=0).isoformat() + + +def _date_key(ts_ms: int) -> str: + return _dt_from_ts_ms(ts_ms).date().isoformat() + + +def _month_key(ts_ms: int) -> str: + dt = _dt_from_ts_ms(ts_ms) + return f"{dt.year:04d}-{dt.month:02d}" + + +def _quarter_key(ts_ms: int) -> str: + dt = _dt_from_ts_ms(ts_ms) + q = (dt.month - 1) // 3 + 1 + return f"{dt.year:04d}-Q{q}" + + +def _quarter_label(qkey: str) -> str: + try: + year_s, q_s = qkey.split("-Q", 1) + year = int(year_s) + q = int(q_s) + except Exception: + return qkey + + ranges = {1: "Jan-Mar", 2: "Apr-Jun", 3: "Jul-Sep", 4: "Oct-Dec"} + return f"{year} {ranges.get(q, f'Q{q}')}" + + +def _weekday_name(ts_ms: int) -> str: + return _dt_from_ts_ms(ts_ms).strftime("%A") + + +def _hour(ts_ms: int) -> int: + return int(_dt_from_ts_ms(ts_ms).hour) + + +def _hour_bucket(h: int) -> str: + if 0 <= h <= 5: + return "00:00-05:59" + if 6 <= h <= 11: + return "06:00-11:59" + if 12 <= h <= 17: + return "12:00-17:59" + return "18:00-23:59" + + +_RE_EMAIL = re.compile(r"(?i)\b[\w.+-]+@[\w.-]+\.[a-z]{2,}\b") +_RE_URL = re.compile(r"(?i)\b(?:https?://|www\.)\S+") +_RE_HANDLE = re.compile(r"(? str: + # Cheap repair for common UTF-8-as-latin1 artifacts. + s = text or "" + if not s: + return s + if not any(ch in s for ch in ("Ã", "Â", "ð", "�")): + return s + try: + repaired = s.encode("latin1", errors="ignore").decode("utf-8", errors="ignore") + except Exception: + return s + # Only accept if it looks cleaner. + bad_before = sum(s.count(ch) for ch in ("Ã", "Â", "ð", "�")) + bad_after = sum(repaired.count(ch) for ch in ("Ã", "Â", "ð", "�")) + return repaired if repaired and bad_after < bad_before else s + + +def _redact(text: str) -> str: + s = (text or "").strip() + if not s: + return "" + s = _RE_EMAIL.sub("", s) + s = _RE_URL.sub("", s) + s = _RE_HANDLE.sub("", s) + s = _RE_NUM.sub("", s) + return s + + +def _tokens(text: str) -> list[str]: + return re.findall(r"\b[\w']+\b", (text or "").lower()) + + +def _guess_lang(text: str) -> Literal["English", "Spanish", "Unknown"]: + s = (text or "").lower() + if not s.strip(): + return "Unknown" + if any(ch in s for ch in "áéíóúñü¿¡"): + return "Spanish" + toks = set(_tokens(s)) + es_hits = len(toks & {"que", "para", "pero", "porque", "gracias", "hola", "vale", "listo", "precio", "libro", "enlace", "llamada", "cita", "taller", "curso"}) + en_hits = len(toks & {"the", "and", "but", "because", "thanks", "hello", "price", "ready", "book", "ebook", "link", "call"}) + if es_hits > en_hits: + return "Spanish" + if en_hits > es_hits: + return "English" + return "Unknown" + + +def _extract_text(msg: dict[str, Any]) -> str | None: + for key in ("content", "text", "message"): + v = msg.get(key) + if isinstance(v, str) and v.strip(): + return _maybe_fix_mojibake(v.strip()) + share = msg.get("share") + if isinstance(share, dict): + parts: list[str] = [] + for k in ("link", "share_text", "original_content_owner"): + v = share.get(k) + if isinstance(v, str) and v.strip(): + parts.append(v.strip()) + if parts: + return _maybe_fix_mojibake(" | ".join(parts)) + return None + + +def _resolve_export_root(input_path: Path) -> Path: + p = input_path + if (p / "messages" / "inbox").is_dir(): + return p + # allow passing messages/inbox directly + if p.is_dir() and p.name.lower() in {"inbox", "message_requests"}: + return p.parent.parent + raise FileNotFoundError(f"Could not find messages/inbox under: {p}") + + +@dataclass(frozen=True) +class Event: + ts_ms: int + sender: str + is_owner: bool + text: str | None + owner_canon: str | None = None + + +def _iter_conversations(export_root: Path) -> Iterable[tuple[str, list[Path]]]: + inbox = export_root / "messages" / "inbox" + for conv_dir in sorted(inbox.iterdir(), key=lambda p: p.name): + if not conv_dir.is_dir(): + continue + parts = sorted(conv_dir.glob("message*.json"), key=lambda p: p.name) + if parts: + yield conv_dir.name, parts + + +def _load_events(parts: list[Path], *, owner_name: str | None) -> list[Event]: + raw_msgs: list[dict[str, Any]] = [] + for p in parts: + try: + data = json.loads(p.read_text(encoding="utf-8", errors="replace")) + except Exception: + continue + msgs = data.get("messages") + if isinstance(msgs, list): + for m in msgs: + if isinstance(m, dict): + raw_msgs.append(m) + + events: list[Event] = [] + for m in raw_msgs: + ts = m.get("timestamp_ms") + sender = m.get("sender_name") + if not isinstance(ts, int) or not isinstance(sender, str) or not sender.strip(): + continue + text = _extract_text(m) + is_owner = bool(owner_name and sender.strip() == owner_name) + canon = canonicalize_text(text) if (is_owner and isinstance(text, str) and text.strip()) else None + events.append(Event(ts_ms=ts, sender=sender.strip(), is_owner=is_owner, text=text, owner_canon=canon)) + events.sort(key=lambda e: e.ts_ms) + return events + + +def _infer_owner_name(export_root: Path, *, max_conversations: int = 2000) -> str | None: + conv_counts: Counter[str] = Counter() + msg_counts: Counter[str] = Counter() + for idx, (_cid, parts) in enumerate(_iter_conversations(export_root), 1): + if idx > max_conversations: + break + events = _load_events(parts, owner_name=None) + seen = set() + for e in events: + msg_counts[e.sender] += 1 + seen.add(e.sender) + for s in seen: + conv_counts[s] += 1 + if not conv_counts: + return None + return max(conv_counts.items(), key=lambda kv: (kv[1], msg_counts.get(kv[0], 0)))[0] + + +def _question_theme(text: str) -> str | None: + s = (text or "").strip().lower() + if not s: + return None + + # One-word / very-short requests (common in keyword-based flows). + s_compact = re.sub(r"[^a-z0-9áéíóúñü]+", " ", s).strip() + toks = s_compact.split() + if len(toks) == 1: + w = toks[0] + if w in {"book", "ebook", "libro", "pdf"}: + return "Just one word: book" + if w in {"link", "enlace"}: + return "Just one word: link" + if w in {"price", "precio", "cost"}: + return "Just one word: price" + + # "I tried, but it didn't arrive / it doesn't work" + if any( + k in s + for k in ( + "didn't receive", + "did not receive", + "haven't received", + "have not received", + "not received", + "didnt receive", + "no me llego", + "no me llegó", + "no me ha llegado", + "no llega", + "no funciona", + "doesn't work", + "doesnt work", + "not working", + "broken", + "error", + "can't find", + "cant find", + "cannot find", + "where is my", + ) + ): + return "I can’t find it / it didn’t arrive" + + # "Is it free?" + if any(k in s for k in ("free", "gratis", "gratuito", "gratuita")): + return "Is it free?" + + # "Can we talk on WhatsApp?" + if any(k in s for k in ("whatsapp", "wa.me", "wapp", "watsapp")): + return "Can we talk on WhatsApp?" + + # "Are you a therapist / is this therapy?" + if any( + k in s + for k in ( + "therapist", + "therapy", + "psychologist", + "psychology", + "psicologo", + "psicólogo", + "psicologia", + "psicología", + "terapia", + ) + ): + return "Is this therapy?" + + # "Where are you based?" + if any( + k in s + for k in ( + "where are you from", + "where are you based", + "where do you live", + "location", + "city", + "country", + "de donde eres", + "de dónde eres", + "donde estas", + "dónde estás", + "ubicacion", + "ubicación", + ) + ): + return "Where are you based?" + + # Price / cost + if any(k in s for k in ("price", "cost", "how much", "$", "€", "usd", "eur", "precio", "cuanto", "cuánto", "caro")): + return "What does it cost?" + + # Link / payment link + if any(k in s for k in ("link", "send the link", "send me the link", "where is the link", "enlace", "stripe", "paypal", "checkout", "invoice")): + return "Can you send the link?" + + # Book / ebook / pdf + if any(k in s for k in ("book", "ebook", "e-book", "pdf", "libro")): + return "Where do I get the book?" + + # Call / schedule + if any( + k in s + for k in ( + "call", + "appointment", + "meeting", + "zoom", + "schedule", + "booking", + "book a call", + "calendly", + "llamada", + "cita", + "reunion", + "reunión", + "agendar", + "reservar", + ) + ): + return "How do I book a call?" + + # Video + if any(k in s for k in ("video", "vídeo", "youtube")): + return "Can you send the video?" + + # Steps / what next + if any(k in s for k in ("what next", "next step", "next steps", "steps", "pasos", "siguiente", "ahora que", "ahora qué")): + return "What are the steps?" + + # How it works / details + if any(k in s for k in ("how does", "how it works", "how does it work", "how does this work", "como funciona", "cómo funciona", "more info", "details", "explain")): + return "How does it work?" + + # What you do / what is this + if any(k in s for k in ("what is this", "what do you do", "what is it", "what do you offer", "service", "services", "que es", "qué es", "que haces", "qué haces", "de que va", "de qué va")): + return "What is this?" + + # Trust / legitimacy + if any(k in s for k in ("real?", "is this real", "legit", "scam", "trust", "estafa")): + return "Is this real?" + + # Timing + if any(k in s for k in ("when", "how long", "duration", "timeframe", "how many days", "cuándo", "cuanto tiempo", "cuánto tiempo")): + return "How long does it take?" + + # Getting started + if any(k in s for k in ("start", "ready", "begin", "how to start", "how do i start", "listo", "empezar", "comenzar")): + return "How do I start?" + + # General help request + if any(k in s for k in ("can you help", "help me", "need help", "ayuda", "necesito ayuda", "me puedes ayudar", "me puede ayudar")): + return "Can you help me?" + + # Refund + if any(k in s for k in ("refund", "money back", "devolucion", "devolución", "reembolso")): + return "Can I get a refund?" + + return None + + +def _is_question(text: str) -> bool: + s = (text or "").strip() + if not s: + return False + if "?" in s or "¿" in s: + return True + low = s.lower().lstrip() + if any(low.startswith(w) for w in ("how ", "what ", "when ", "where ", "why ", "can ", "could ", "do ", "does ", "did ", "is ", "are ", "will ", "would ", "should ")): + return True + return _question_theme(s) is not None + + +def _offer_terms(text: str) -> set[str]: + s = (text or "").lower() + if not s.strip(): + return set() + terms = set() + for term, key in ( + ("whatsapp", "WhatsApp"), + ("youtube", "YouTube"), + ("video", "Video"), + ("vídeo", "Video"), + ("book", "Book"), + ("ebook", "Ebook"), + ("e-book", "Ebook"), + ("libro", "Book"), + ("pdf", "PDF"), + ("call", "Call"), + ("llamada", "Call"), + ("cita", "Call"), + ("calendly", "Calendly"), + ("zoom", "Zoom"), + ("taller", "Workshop"), + ("workshop", "Workshop"), + ("curso", "Course"), + ("course", "Course"), + ("programa", "Program"), + ("program", "Program"), + ("formacion", "Training"), + ("formación", "Training"), + ("formation", "Training"), + ("training", "Training"), + ("plataforma", "Platform"), + ("platform", "Platform"), + ("comunidad", "Community"), + ("community", "Community"), + ("coaching", "Coaching"), + ("mentorship", "Mentorship"), + ("mentoring", "Mentorship"), + ("audit", "Audit"), + ("stripe", "Stripe"), + ("paypal", "PayPal"), + ): + if term in s: + terms.add(key) + return terms + + +@dataclass +class TimeBucketStats: + inbound: int = 0 + outbound: int = 0 + inbound_questions: int = 0 + inbound_responded: int = 0 + inbound_questions_responded: int = 0 + outbound_text: int = 0 + outbound_repeat_text: int = 0 + outbound_custom_text: int = 0 + outbound_non_text: int = 0 + + +def _pct(n: int, d: int) -> str: + return "n/a" if d <= 0 else f"{(n/d)*100:.1f}%" + + +def _fmt_duration(seconds: int | None) -> str: + if seconds is None: + return "n/a" + if seconds < 60: + return f"{seconds} seconds" + minutes = seconds // 60 + if minutes < 60: + return f"{minutes} minutes" + hours = minutes // 60 + minutes_rem = minutes % 60 + if hours < 48: + if minutes_rem == 0: + return f"{hours} hours" + return f"{hours}h {minutes_rem}m" + days = hours // 24 + hours_rem = hours % 24 + if hours_rem == 0: + return f"{days} days" + return f"{days}d {hours_rem}h" + + +def generate_report( + *, + export_input: Path, + analysis_dir: Path | None, + out_path: Path, + owner_name: str | None, + reply_window_hours: float, + scripted_min_count: int, +) -> Path: + export_root = _resolve_export_root(export_input) + + summary: dict[str, Any] | None = None + if analysis_dir: + p = analysis_dir / "summary.json" + if p.exists(): + summary = json.loads(p.read_text(encoding="utf-8", errors="replace")) + + owner_name = owner_name or (summary.get("owner_name") if isinstance(summary, dict) else None) + if not owner_name or not isinstance(owner_name, str) or not owner_name.strip(): + owner_name = _infer_owner_name(export_root) + if not owner_name: + raise RuntimeError("Could not infer owner name. Pass --owner-name.") + + conversations: list[list[Event]] = [] + outgoing_template_counts: Counter[str] = Counter() + for _cid, parts in _iter_conversations(export_root): + events = _load_events(list(parts), owner_name=owner_name) + if not events: + continue + conversations.append(events) + for e in events: + if e.is_owner and isinstance(e.owner_canon, str) and e.owner_canon.strip(): + outgoing_template_counts[e.owner_canon] += 1 + + scripted_templates = {canon for canon, cnt in outgoing_template_counts.items() if cnt >= scripted_min_count} + + # Aggregate metrics + min_ts: int | None = None + max_ts: int | None = None + total_msgs = 0 + inbound_msgs = 0 + outbound_msgs = 0 + inbound_questions = 0 + system_new_follower_msgs = 0 + + lang_out = Counter() + lang_in = Counter() + lang_out_by_month: dict[str, Counter[str]] = defaultdict(Counter) + lang_in_by_month: dict[str, Counter[str]] = defaultdict(Counter) + lang_out_by_quarter: dict[str, Counter[str]] = defaultdict(Counter) + lang_in_by_quarter: dict[str, Counter[str]] = defaultdict(Counter) + + by_month: dict[str, TimeBucketStats] = defaultdict(TimeBucketStats) + by_quarter: dict[str, TimeBucketStats] = defaultdict(TimeBucketStats) + by_weekday: dict[str, TimeBucketStats] = defaultdict(TimeBucketStats) + by_hour_bucket: dict[str, TimeBucketStats] = defaultdict(TimeBucketStats) + by_date: dict[str, TimeBucketStats] = defaultdict(TimeBucketStats) + + question_themes: Counter[str] = Counter() + question_theme_by_month: dict[str, Counter[str]] = defaultdict(Counter) + question_theme_by_quarter: dict[str, Counter[str]] = defaultdict(Counter) + question_theme_by_weekday: dict[str, Counter[str]] = defaultdict(Counter) + question_theme_by_hour_bucket: dict[str, Counter[str]] = defaultdict(Counter) + offer_terms_by_month: dict[str, Counter[str]] = defaultdict(Counter) + offer_terms_by_quarter: dict[str, Counter[str]] = defaultdict(Counter) + + reply_delays_s: list[int] = [] + reply_delays_questions_s: list[int] = [] + reply_delays_scripted_s: list[int] = [] + reply_delays_custom_s: list[int] = [] + reply_delays_questions_scripted_s: list[int] = [] + reply_delays_questions_custom_s: list[int] = [] + + owner_owner_gaps_s: list[int] = [] + + reply_window_s = int(reply_window_hours * 3600) + + for events in conversations: + pending_inbound_ts: int | None = None + pending_inbound_is_question = False + prev: Event | None = None + + for e in events: + if prev is not None and prev.is_owner and e.is_owner: + gap_s = int((e.ts_ms - prev.ts_ms) / 1000) + if gap_s >= 0: + owner_owner_gaps_s.append(gap_s) + + total_msgs += 1 + min_ts = e.ts_ms if min_ts is None or e.ts_ms < min_ts else min_ts + max_ts = e.ts_ms if max_ts is None or e.ts_ms > max_ts else max_ts + + month = _month_key(e.ts_ms) + quarter = _quarter_key(e.ts_ms) + weekday = _weekday_name(e.ts_ms) + hb = _hour_bucket(_hour(e.ts_ms)) + day = _date_key(e.ts_ms) + + if e.is_owner: + outbound_msgs += 1 + by_month[month].outbound += 1 + by_quarter[quarter].outbound += 1 + by_weekday[weekday].outbound += 1 + by_hour_bucket[hb].outbound += 1 + by_date[day].outbound += 1 + + if e.text: + lang = _guess_lang(e.text) + lang_out[lang] += 1 + lang_out_by_month[month][lang] += 1 + lang_out_by_quarter[quarter][lang] += 1 + + is_repeat = bool(e.owner_canon and e.owner_canon in scripted_templates) + by_month[month].outbound_text += 1 + by_quarter[quarter].outbound_text += 1 + by_weekday[weekday].outbound_text += 1 + by_hour_bucket[hb].outbound_text += 1 + by_date[day].outbound_text += 1 + if is_repeat: + by_month[month].outbound_repeat_text += 1 + by_quarter[quarter].outbound_repeat_text += 1 + by_weekday[weekday].outbound_repeat_text += 1 + by_hour_bucket[hb].outbound_repeat_text += 1 + by_date[day].outbound_repeat_text += 1 + else: + by_month[month].outbound_custom_text += 1 + by_quarter[quarter].outbound_custom_text += 1 + by_weekday[weekday].outbound_custom_text += 1 + by_hour_bucket[hb].outbound_custom_text += 1 + by_date[day].outbound_custom_text += 1 + + offers = _offer_terms(e.text) + for t in offers: + offer_terms_by_month[month][t] += 1 + offer_terms_by_quarter[quarter][t] += 1 + if e.text.lower().startswith("you messaged") and "followed your account" in e.text.lower(): + system_new_follower_msgs += 1 + else: + by_month[month].outbound_non_text += 1 + by_quarter[quarter].outbound_non_text += 1 + by_weekday[weekday].outbound_non_text += 1 + by_hour_bucket[hb].outbound_non_text += 1 + by_date[day].outbound_non_text += 1 + + # Mark response to the most recent inbound message. + if pending_inbound_ts is not None: + delay_s = int((e.ts_ms - pending_inbound_ts) / 1000) + if 0 <= delay_s <= reply_window_s: + reply_delays_s.append(delay_s) + is_scripted = bool(e.owner_canon and e.owner_canon in scripted_templates) + if is_scripted: + reply_delays_scripted_s.append(delay_s) + else: + reply_delays_custom_s.append(delay_s) + by_month[month].inbound_responded += 1 + by_quarter[quarter].inbound_responded += 1 + by_weekday[weekday].inbound_responded += 1 + by_hour_bucket[hb].inbound_responded += 1 + by_date[day].inbound_responded += 1 + if pending_inbound_is_question: + reply_delays_questions_s.append(delay_s) + if is_scripted: + reply_delays_questions_scripted_s.append(delay_s) + else: + reply_delays_questions_custom_s.append(delay_s) + by_month[month].inbound_questions_responded += 1 + by_quarter[quarter].inbound_questions_responded += 1 + by_weekday[weekday].inbound_questions_responded += 1 + by_hour_bucket[hb].inbound_questions_responded += 1 + by_date[day].inbound_questions_responded += 1 + pending_inbound_ts = None + pending_inbound_is_question = False + + else: + inbound_msgs += 1 + by_month[month].inbound += 1 + by_quarter[quarter].inbound += 1 + by_weekday[weekday].inbound += 1 + by_hour_bucket[hb].inbound += 1 + by_date[day].inbound += 1 + + if e.text: + lang = _guess_lang(e.text) + lang_in[lang] += 1 + lang_in_by_month[month][lang] += 1 + lang_in_by_quarter[quarter][lang] += 1 + + is_q = _is_question(e.text) + if is_q: + inbound_questions += 1 + by_month[month].inbound_questions += 1 + by_quarter[quarter].inbound_questions += 1 + by_weekday[weekday].inbound_questions += 1 + by_hour_bucket[hb].inbound_questions += 1 + by_date[day].inbound_questions += 1 + + theme = _question_theme(e.text) or "Other question" + question_themes[theme] += 1 + question_theme_by_month[month][theme] += 1 + question_theme_by_quarter[quarter][theme] += 1 + question_theme_by_weekday[weekday][theme] += 1 + question_theme_by_hour_bucket[hb][theme] += 1 + + pending_inbound_ts = e.ts_ms + pending_inbound_is_question = bool(e.text and _is_question(e.text)) + + prev = e + + if min_ts is None or max_ts is None: + raise RuntimeError("No messages found in export.") + + window_start = _dt_from_ts_ms(min_ts) + window_end = _dt_from_ts_ms(max_ts) + window_days = max(1, int((window_end - window_start).total_seconds() / 86400) + 1) + + def median_seconds(values: list[int]) -> int | None: + return int(statistics.median(values)) if values else None + + def p90_seconds(values: list[int]) -> int | None: + if len(values) < 20: + return None + return int(statistics.quantiles(values, n=10)[-1]) + + med_reply_s = median_seconds(reply_delays_s) + p90_reply_s = p90_seconds(reply_delays_s) + med_q_reply_s = median_seconds(reply_delays_questions_s) + p90_q_reply_s = p90_seconds(reply_delays_questions_s) + + # Build ranked tables (no raw message quotes). + top_question_themes = question_themes.most_common(20) + top_offer_terms = Counter() + for _m, c in offer_terms_by_month.items(): + top_offer_terms.update(c) + top_offer_terms_ranked = top_offer_terms.most_common(10) + + # Seasonality proxies (month/quarter volume). + months_sorted = sorted(by_month.keys()) + quarters_sorted = sorted(by_quarter.keys()) + busiest_months = sorted(by_month.items(), key=lambda kv: kv[1].inbound + kv[1].outbound, reverse=True)[:5] + + # Daily pattern within window. + dates_sorted = sorted(by_date.keys()) + + outbound_with_text = sum(outgoing_template_counts.values()) + scripted_outbound_with_text = sum(outgoing_template_counts[c] for c in scripted_templates if c in outgoing_template_counts) + custom_outbound_with_text = outbound_with_text - scripted_outbound_with_text + outbound_no_text = outbound_msgs - outbound_with_text + + total_inbound_responded = sum(st.inbound_responded for st in by_month.values()) + total_inbound_questions_responded = sum(st.inbound_questions_responded for st in by_month.values()) + + med_scripted_reply_s = median_seconds(reply_delays_scripted_s) + med_custom_reply_s = median_seconds(reply_delays_custom_s) + slow_scripted_reply_s = p90_seconds(reply_delays_scripted_s) + slow_custom_reply_s = p90_seconds(reply_delays_custom_s) + + med_scripted_q_reply_s = median_seconds(reply_delays_questions_scripted_s) + med_custom_q_reply_s = median_seconds(reply_delays_questions_custom_s) + + # Report body (English-only, no persona names, no technical labels). + now = datetime.now(timezone.utc).date().isoformat() + report: list[str] = [] + + report.append("# Socialmediatorr Instagram DM History : Plain-English Deep Report") + report.append("") + report.append("## DM History Deep Report") + report.append("") + report.append(f"**Subject:** Instagram direct messages for `@socialmediatorr`") + report.append("**Version:** v1.0 (STYLE BIBLE EN 3.0GM)") + report.append(f"**Date:** {now}") + report.append("**Status:** REVIEW REQUIRED") + report.append("**Citation:** `if://report/socialmediatorr/instagram/dm-history/`") + report.append("**Author:** Danny Stocker | InfraFabric Research") + report.append("") + report.append("### How This Report Was Made") + report.append("") + report.append("> This is an automated count of patterns. It is not a therapy note and it is not a sales ledger.") + report.append("") + report.append( + "This document was generated by reading an Instagram data export and counting repeat patterns over time. " + "It avoids quoting private client messages and it avoids storing personal identities." + ) + report.append("") + report.append("---") + report.append("") + report.append(f"**Context:** This inbox contains a high-volume message-and-reply system over {window_days} days.") + report.append("") + report.append("> Your messaging system is working as a volume engine. The weak point is consistency at the moments where people ask to buy or book.") + report.append("") + report.append( + "The purpose of this report is practical: define what to keep, what to remove, and what to automate safely—without damaging trust." + ) + report.append("") + + report.append("### What Happened") + report.append("") + report.append("> This export shows a dense campaign period, not a quiet inbox.") + report.append("") + report.append( + "Across the observed window, you sent a very large number of messages and you received a smaller number of replies back. " + "That is normal when you are messaging lots of people: many contacts, fewer responders." + ) + report.append("") + report.append("| Metric | Value | Source |") + report.append("|---|---:|---|") + report.append(f"| Observation window (earliest → latest) | {window_start.date().isoformat()} → {window_end.date().isoformat()} | Instagram export |") + report.append(f"| Total messages | {total_msgs:,} | Instagram export |") + report.append(f"| Messages you sent | {outbound_msgs:,} | Instagram export |") + report.append(f"| Messages people sent you | {inbound_msgs:,} | Instagram export |") + report.append(f"| Messages that look like a question or a request | {inbound_questions:,} | Instagram export |") + report.append(f"| System messages about new followers (auto text in the inbox) | {system_new_follower_msgs:,} | Instagram export |") + report.append("") + + top_theme, top_theme_cnt = top_question_themes[0] if top_question_themes else ("n/a", 0) + weekday_order_quick = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"] + top_question_days = sorted( + weekday_order_quick, + key=lambda wd: by_weekday.get(wd, TimeBucketStats()).inbound_questions, + reverse=True, + )[:2] + top_question_days_s = ", ".join(top_question_days) if top_question_days else "n/a" + hb_order_quick = ["00:00-05:59", "06:00-11:59", "12:00-17:59", "18:00-23:59"] + top_time_blocks = sorted( + hb_order_quick, + key=lambda hb: by_hour_bucket.get(hb, TimeBucketStats()).inbound, + reverse=True, + )[:2] + top_time_blocks_s = ", ".join(top_time_blocks) if top_time_blocks else "n/a" + in_lang_total = sum(lang_in.values()) or 1 + out_lang_total = sum(lang_out.values()) or 1 + in_lang_summary = ( + ", ".join([f"{lang} {_pct(int(lang_in.get(lang, 0)), in_lang_total)}" for lang in ("Spanish", "English", "Unknown") if int(lang_in.get(lang, 0))]) + or "n/a" + ) + out_lang_summary = ( + ", ".join([f"{lang} {_pct(int(lang_out.get(lang, 0)), out_lang_total)}" for lang in ("Spanish", "English", "Unknown") if int(lang_out.get(lang, 0))]) + or "n/a" + ) + + report.append("### What You Need to Know (In Plain English)") + report.append("") + report.append("> If you only read one section, read this one.") + report.append("") + report.append( + "This inbox has a few dominant patterns. They tell you what an auto-reply system must be good at, and where it must hand off to a human." + ) + report.append("") + report.append("| Thing to know | Why it matters |") + report.append("|---|---|") + if busiest_months: + m, st = busiest_months[0] + m_total = st.inbound + st.outbound + report.append(f"| Most activity happened in **{m}** | {m_total:,} messages in one month ({_pct(m_total, total_msgs)} of all messages in this export) |") + report.append(f"| The #1 question/topic is **{top_theme}** | {top_theme_cnt:,} times ({_pct(top_theme_cnt, inbound_questions)} of all questions/requests) |") + report.append(f"| Questions/requests cluster on **{top_question_days_s}** | Those two days contain most of the asking in this export |") + report.append(f"| Most messages arrive in **{top_time_blocks_s} (UTC)** | If you are present in those blocks, reply rates usually improve |") + report.append(f"| Repeat messages make up **{_pct(scripted_outbound_with_text, outbound_with_text)}** of your text messages | Fast replies are often repeats; custom replies are where delays happen |") + report.append(f"| Language used by people (approx) | {in_lang_summary} |") + report.append(f"| Language used in your messages (approx) | {out_lang_summary} |") + report.append(f"| Custom-reply slow end | 90% of custom replies are faster than **{_fmt_duration(slow_custom_reply_s)}** |") + report.append("") + + report.append("### Key Patterns Over Time") + report.append("") + report.append("> The month-by-month shape is uneven. There are clear bursts.") + report.append("") + report.append( + "To avoid guesswork, we start with 3-month blocks (a simple way to smooth noise), then we go month-by-month." + ) + report.append("") + + report.append("| 3-month block | Messages from people | Messages you sent | Questions/requests |") + report.append("|---|---:|---:|---:|") + for q in quarters_sorted: + st = by_quarter.get(q, TimeBucketStats()) + report.append(f"| {_quarter_label(q)} | {st.inbound:,} | {st.outbound:,} | {st.inbound_questions:,} |") + report.append("") + report.append("Same data as charts:") + report.append("") + + # Quarter distribution diagrams (pie) + quarter_pie_in = [(_quarter_label(q), by_quarter[q].inbound) for q in quarters_sorted if by_quarter[q].inbound] + quarter_pie_out = [(_quarter_label(q), by_quarter[q].outbound) for q in quarters_sorted if by_quarter[q].outbound] + + report.append("```mermaid") + report.append('pie title Messages From People by 3-Month Block') + for q, v in quarter_pie_in: + report.append(f' \"{q}\" : {v}') + report.append("```") + report.append("") + report.append("This shows when people replied most. A spike here usually means you posted something or you asked people to DM you.") + report.append("") + + report.append("```mermaid") + report.append('pie title Messages You Sent by 3-Month Block') + for q, v in quarter_pie_out: + report.append(f' \"{q}\" : {v}') + report.append("```") + report.append("") + report.append("This shows when you sent the most messages. A spike here is effort; the question is how many people replied back.") + report.append("") + + report.append("### Month by Month (The Real Shape)") + report.append("") + report.append("> One big month dominates. Treat earlier months as less reliable.") + report.append("") + report.append("This month-by-month table is the clearest view of how the inbox changed over time in this export.") + report.append("") + report.append("| Month | Messages from people | Messages you sent | Questions/requests | Questions answered within 48 hours |") + report.append("|---|---:|---:|---:|---:|") + for m in months_sorted: + st = by_month.get(m, TimeBucketStats()) + report.append(f"| {m} | {st.inbound:,} | {st.outbound:,} | {st.inbound_questions:,} | {_pct(st.inbound_questions_responded, st.inbound_questions)} |") + report.append("") + + busiest_m, busiest_st = busiest_months[0] if busiest_months else (None, None) + if busiest_m and busiest_st: + busiest_total = busiest_st.inbound + busiest_st.outbound + report.append( + f"The busiest month was **{busiest_m}** with **{busiest_total:,}** messages total ({_pct(busiest_total, total_msgs)} of everything in this export). That single month dominates the shape of the data." + ) + report.append("") + + report.append("### Days People Reply") + report.append("") + report.append("> The best day to follow up is the day people already reply.") + report.append("") + report.append( + "Use this to time follow-ups and first messages. Do not spread effort evenly across the week." + ) + report.append("") + + weekday_order = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"] + report.append("| Day of week | Messages from people | Messages you sent | Questions/requests |") + report.append("|---|---:|---:|---:|") + for wd in weekday_order: + st = by_weekday.get(wd, TimeBucketStats()) + report.append(f"| {wd} | {st.inbound:,} | {st.outbound:,} | {st.inbound_questions:,} |") + report.append("") + report.append("Same data as a chart:") + report.append("") + report.append("```mermaid") + report.append('pie title Messages From People by Day of Week') + for wd in weekday_order: + v = by_weekday.get(wd, TimeBucketStats()).inbound + if v: + report.append(f' \"{wd}\" : {v}') + report.append("```") + report.append("") + + report.append("### Time of Day People Reply") + report.append("") + report.append("> Most replies happen in a few time blocks.") + report.append("") + report.append( + "Time zone here is UTC (a standard clock). If you work in another time zone, shift the blocks before you schedule." + ) + report.append("") + hb_order = ["00:00-05:59", "06:00-11:59", "12:00-17:59", "18:00-23:59"] + report.append("| Time of day (UTC) | Messages from people | Messages you sent |") + report.append("|---|---:|---:|") + for hb in hb_order: + st = by_hour_bucket.get(hb, TimeBucketStats()) + report.append(f"| {hb} | {st.inbound:,} | {st.outbound:,} |") + report.append("") + report.append("Same data as a chart:") + report.append("") + report.append("```mermaid") + report.append('pie title Messages From People by Time of Day (UTC)') + for hb in hb_order: + v = by_hour_bucket.get(hb, TimeBucketStats()).inbound + if v: + report.append(f' \"{hb}\" : {v}') + report.append("```") + report.append("") + + report.append("### Reply Speed (Why It Matters)") + report.append("") + report.append("> Speed changes the feeling of safety.") + report.append("") + report.append( + "When someone asks a question, the clock starts. A short, direct acknowledgment often beats a perfect answer that arrives too late." + ) + report.append("") + report.append( + "One caution: “fast replies” are often repeat messages. This section shows overall speed, then splits it into repeat messages vs custom messages." + ) + report.append("") + report.append("| Metric | Value | Source |") + report.append("|---|---:|---|") + report.append(f"| Typical time to reply | {_fmt_duration(med_reply_s)} | Instagram export |") + report.append(f"| Slow end (90% are faster) | {_fmt_duration(p90_reply_s)} | Instagram export |") + report.append(f"| Typical time to reply to questions/requests | {_fmt_duration(med_q_reply_s)} | Instagram export |") + report.append(f"| Slow end for questions/requests (90% are faster) | {_fmt_duration(p90_q_reply_s)} | Instagram export |") + report.append(f"| Messages from people answered within {int(reply_window_hours)} hours | {total_inbound_responded:,} ({_pct(total_inbound_responded, inbound_msgs)}) | Instagram export |") + report.append(f"| Questions/requests answered within {int(reply_window_hours)} hours | {total_inbound_questions_responded:,} ({_pct(total_inbound_questions_responded, inbound_questions)}) | Instagram export |") + report.append("") + report.append("Breakdown by message type (repeat messages vs custom messages):") + report.append("") + report.append("| Type of message you sent (text only) | Count | Typical reply speed | Slow end (90% are faster) |") + report.append("|---|---:|---:|---:|") + report.append(f"| Repeat messages | {scripted_outbound_with_text:,} | {_fmt_duration(med_scripted_reply_s)} | {_fmt_duration(slow_scripted_reply_s)} |") + report.append(f"| Custom messages | {custom_outbound_with_text:,} | {_fmt_duration(med_custom_reply_s)} | {_fmt_duration(slow_custom_reply_s)} |") + report.append(f"| No text (media/reactions) | {outbound_no_text:,} | n/a | n/a |") + report.append("") + report.append("| Type of message you sent (questions only) | Typical reply speed |") + report.append("|---|---:|") + report.append(f"| Repeat messages | {_fmt_duration(med_scripted_q_reply_s)} |") + report.append(f"| Custom messages | {_fmt_duration(med_custom_q_reply_s)} |") + report.append("") + + report.append("### Language Mix (What Language People Use)") + report.append("") + report.append("> Matching the other person’s language increases trust and reduces back-and-forth.") + report.append("") + report.append( + "This is an approximate language guess based on the text itself. Short one-word messages are harder to classify and may show up as “Unknown”." + ) + report.append("") + report.append("| Language | Messages from people | Messages you sent (text only) |") + report.append("|---|---:|---:|") + for lang in ("Spanish", "English", "Unknown"): + report.append( + f"| {lang} | {lang_in.get(lang, 0):,} ({_pct(lang_in.get(lang, 0), in_lang_total)}) | {lang_out.get(lang, 0):,} ({_pct(lang_out.get(lang, 0), out_lang_total)}) |" + ) + report.append("") + report.append("Language split chart (messages from people):") + report.append("") + report.append("```mermaid") + report.append("pie title Language Split (Messages From People)") + for lang in ("Spanish", "English", "Unknown"): + v = int(lang_in.get(lang, 0)) + if v: + report.append(f' \"{lang}\" : {v}') + report.append("```") + report.append("") + report.append("Language split chart (messages you sent):") + report.append("") + report.append("```mermaid") + report.append("pie title Language Split (Messages You Sent)") + for lang in ("Spanish", "English", "Unknown"): + v = int(lang_out.get(lang, 0)) + if v: + report.append(f' \"{lang}\" : {v}') + report.append("```") + report.append("") + + report.append("### Top 20 Things People Ask or Type (Ranked)") + report.append("") + report.append("> People repeat the same questions. This is the easiest thing to standardize.") + report.append("") + report.append( + "This list is grouped by meaning (not by exact wording). It includes very short requests (sometimes a single word)." + ) + report.append("") + report.append("| Rank | Topic (plain English) | Count | Share of all questions/requests |") + report.append("|---:|---|---:|---:|") + for i, (theme, cnt) in enumerate(top_question_themes, 1): + report.append(f"| {i} | {theme} | {cnt:,} | {_pct(cnt, inbound_questions)} |") + report.append("") + book_related = int(question_themes.get("Just one word: book", 0)) + int(question_themes.get("Where do I get the book?", 0)) + other_related = max(0, int(inbound_questions) - book_related) + report.append( + f"In plain terms: **{book_related:,}** of **{inbound_questions:,}** questions/requests are about the book ({_pct(book_related, inbound_questions)})." + ) + report.append("") + report.append("```mermaid") + report.append("pie title Questions/Requests: Book vs Everything Else") + if book_related: + report.append(f' \"Book\" : {book_related}') + if other_related: + report.append(f' \"Everything else\" : {other_related}') + report.append("```") + report.append("") + + report.append("### Content Patterns (What You Mention When You Sell)") + report.append("") + report.append("> Content is not random. It leaves fingerprints in the inbox.") + report.append("") + report.append( + "We track certain words over time that usually show up when you are giving someone a next step (book, video, WhatsApp, call, payment, etc). " + "This lets you see what dominated each period, without reading private conversations." + ) + report.append("") + report.append("| Word found in your messages | Mentions | Source |") + report.append("|---|---:|---|") + for term, cnt in top_offer_terms_ranked: + report.append(f"| {term} | {cnt:,} | Instagram export (message text) |") + report.append("") + + report.append("```mermaid") + report.append("flowchart LR") + q_nodes: list[str] = [] + for q in quarters_sorted: + node_id = "Q_" + q.replace("-", "_") + q_nodes.append(node_id) + top_terms = offer_terms_by_quarter.get(q, Counter()).most_common(3) + term_str = ", ".join([f"{t} ({c})" for t, c in top_terms]) if top_terms else "No signals" + report.append(f' {node_id}["{_quarter_label(q)}: {term_str}"]') + for i in range(len(q_nodes) - 1): + report.append(f" {q_nodes[i]} --> {q_nodes[i + 1]}") + report.append("```") + report.append("") + report.append("This diagram is a high-level view of what you talked about most in each period (based on those words).") + report.append("") + + report.append("### Follow-Ups (When People Do Not Reply)") + report.append("") + report.append("> Silence is where most conversations die.") + report.append("") + report.append( + "This section measures the time between two messages you sent in a row **when the other person did not reply in between**. " + "Very short gaps are usually multi-part scripts. Longer gaps are true follow-ups." + ) + report.append("") + gap_buckets = Counter() + for gap_s in owner_owner_gaps_s: + if gap_s <= 60: + gap_buckets["Under 1 minute"] += 1 + elif gap_s <= 600: + gap_buckets["1-10 minutes"] += 1 + elif gap_s <= 3600: + gap_buckets["10-60 minutes"] += 1 + elif gap_s <= 6 * 3600: + gap_buckets["1-6 hours"] += 1 + elif gap_s <= 24 * 3600: + gap_buckets["6-24 hours"] += 1 + elif gap_s <= 3 * 24 * 3600: + gap_buckets["1-3 days"] += 1 + else: + gap_buckets["Over 3 days"] += 1 + gap_total = sum(gap_buckets.values()) or 1 + bucket_order = ["Under 1 minute", "1-10 minutes", "10-60 minutes", "1-6 hours", "6-24 hours", "1-3 days", "Over 3 days"] + report.append("| Time gap between two messages you sent | Count | Share |") + report.append("|---|---:|---:|") + for b in bucket_order: + report.append(f"| {b} | {gap_buckets.get(b, 0):,} | {_pct(gap_buckets.get(b, 0), gap_total)} |") + report.append("") + report.append("Same data as a chart:") + report.append("") + report.append("```mermaid") + report.append("pie title Time Between Two Messages You Sent (No Reply In Between)") + for b in bucket_order: + v = int(gap_buckets.get(b, 0)) + if v: + report.append(f' \"{b}\" : {v}') + report.append("```") + report.append("") + + report.append("### Recommended Actions (Concrete, Ranked)") + report.append("") + report.append("> Most improvements are not “more messages.” They are better timing and cleaner answers.") + report.append("") + report.append( + "These actions are intentionally practical. Each one can be implemented without changing your tone." + ) + report.append("") + report.append("| Priority | Action | Why it matters | How to check it worked |") + report.append("|---:|---|---|---|") + report.append("| 1 | Write 20 ready-made answers for the Top 20 list | Stops delays and confusion on the most repeated questions | Fewer follow-up messages like “I can’t find it” and faster replies on question days |") + report.append("| 2 | Fix the book steps from start to finish (book → link → 1 question) | The book dominates the inbox; this must have no extra steps | People stop asking twice; fewer broken-link complaints |") + report.append("| 3 | Add one short “I saw this” reply for busy hours | Keeps the person engaged even if you are not ready to write a full reply | More conversations continue instead of going silent |") + report.append("| 4 | Be present on the top question days | Most questions cluster on a small set of days | Higher question reply rate on those days |") + report.append("| 5 | Mirror language by default (English ↔ Spanish) | Reduces misunderstandings and builds trust | Fewer “What?” / “Explain” messages; more smooth back-and-forth |") + report.append("| 6 | Add a follow-up rule when people go silent | Many sales die in silence; a simple follow-up rescues them | More replies after 24–48 hours; fewer dead threads |") + report.append("| 7 | Add a “can’t find it / didn’t arrive” reply | This moment often creates distrust | Complaints resolve quickly; fewer repeated requests |") + report.append("| 8 | Keep deep clinical discussion out of DMs | DMs are a bad place for nuance and risk | Shorter DM threads; more booked calls when needed |") + report.append("| 9 | Add hard safety rules (crisis → human) | Automation must never handle high-risk situations | Fewer risky back-and-forth messages; clear handoff |") + report.append("| 10 | Connect real outcomes (payments/bookings) to a tracker | Without it, you can’t tell what actually worked | You can answer: “Which messages lead to paid outcomes?” |") + report.append("") + + report.append("### What You Do Not Need to Know") + report.append("") + report.append("> Privacy keeps you safe. You do not need to memorize people to help them.") + report.append("") + report.append( + "To run an auto-reply system safely, you only need patterns, not personal identities. " + "Storing extra detail increases risk without improving results." + ) + report.append("") + report.append("| Do not store | Why it is unnecessary |") + report.append("|---|---|") + report.append("| Names, handles, phone numbers, emails | Not needed for pattern-based replies; increases privacy risk |") + report.append("| Full message transcripts for every thread | You only need grouped question themes + best replies |") + report.append("| Photos/videos/audio attachments | High risk and high storage cost; not required for reply style |") + report.append("| One-off edge cases | Build rules for repeated situations; escalate rare cases to a human |") + report.append("") + + report.append("### Decision Map for Safe Auto-Replies") + report.append("") + report.append("> A safe system is not smart. It is consistent.") + report.append("") + report.append( + "This is a simple decision map that an auto-reply system can follow. The goal is to answer quickly, stay consistent, and avoid risky situations." + ) + report.append("") + report.append("```mermaid") + report.append("flowchart LR") + report.append(' A1[\"New message\"] --> A2{\"What is it?\"}') + report.append(' A2 -->|\"Price\"| B1[\"Answer the price; ask if they want the link or a call\"]') + report.append(' A2 -->|\"Link\"| B2[\"Send the link; ask one yes/no question\"]') + report.append(' A2 -->|\"Book\"| B3[\"Explain what the book is; ask what they want to change\"]') + report.append(' A2 -->|\"Book a call\"| B4[\"Send booking steps; ask their time zone\"]') + report.append(' A2 -->|\"Trust\"| B5[\"Give one simple proof; suggest a call if needed\"]') + report.append(' A2 -->|\"Other\"| B6[\"Short answer; ask one clarifying question\"]') + report.append(' B1 --> C1[\"If no reply: follow up on a high-reply day\"]') + report.append(' B2 --> C1') + report.append(' B3 --> C1') + report.append(' B4 --> C1') + report.append(' B5 --> C1') + report.append(' B6 --> C1') + report.append("```") + report.append("") + report.append("A system like this can cover most conversations without pretending to be a therapist in DM.") + report.append("") + + out_path.parent.mkdir(parents=True, exist_ok=True) + out_path.write_text("\n".join(report) + "\n", encoding="utf-8") + _safe_chmod_600(out_path) + return out_path + + +def main(argv: list[str] | None = None) -> int: + ap = argparse.ArgumentParser(description="Generate a deeper plain-English DM report from an Instagram export.") + ap.add_argument("--export-input", required=True, help="Instagram export root (contains messages/inbox)") + ap.add_argument("--analysis-dir", default=None, help="optional analyze_instagram_export output dir (for summary.json)") + ap.add_argument("--out", default=None, help="output markdown path") + ap.add_argument("--owner-name", default=None, help="owner sender_name (default: infer)") + ap.add_argument("--reply-window-hours", type=float, default=48.0, help="how long counts as a response") + ap.add_argument( + "--scripted-min-count", + type=int, + default=50, + help="outgoing template messages sent >= this count are treated as repeat templates", + ) + args = ap.parse_args(argv) + + export_input = Path(args.export_input) + analysis_dir = Path(args.analysis_dir) if args.analysis_dir else None + out_path = Path(args.out) if args.out else Path.cwd() / "dm_history_report_en_detailed.md" + + try: + p = generate_report( + export_input=export_input, + analysis_dir=analysis_dir, + out_path=out_path, + owner_name=(args.owner_name.strip() if args.owner_name else None), + reply_window_hours=float(args.reply_window_hours), + scripted_min_count=int(args.scripted_min_count), + ) + print(json.dumps({"ok": True, "out": str(p)}, ensure_ascii=False)) + return 0 + except FileNotFoundError as e: + print(f"Missing export input: {e}", file=os.sys.stderr) + return 2 + except Exception as e: + print(f"Report generation failed: {e}", file=os.sys.stderr) + return 1 + + +if __name__ == "__main__": + raise SystemExit(main())