Improve language detection (EN/FR/ES/CA)

2025-12-24 11:12:06 +00:00 · 2025-12-24 11:12:06 +00:00 · 817995ac6b
commit 817995ac6b
parent 9286a313f0
2 changed files with 237 additions and 42 deletions
--- a/reports/socialmediatorr/dm_history_report_en_detailed.md
+++ b/reports/socialmediatorr/dm_history_report_en_detailed.md
@ -51,8 +51,8 @@ This inbox has a few dominant patterns. They tell you what an auto-reply system
 | Questions/requests cluster on **Thursday, Friday** | Those two days contain most of the asking in this export |
 | Most messages arrive in **18:00-23:59, 12:00-17:59 (CET)** | If you are present in those blocks, reply rates usually improve |
 | Repeat messages make up **67.6%** of your text messages | Fast replies are often repeats; custom replies are where delays happen |
-| Language used by people (approx) | Spanish 25.7%, English 18.1%, Unknown 56.2% |
-| Language used in your messages (approx) | Spanish 63.8%, English 29.2%, Unknown 7.0% |
+| Language used by people (approx, longer messages only) | French 3.6%, English 54.2%, Spanish 41.8%, Catalan 0.3% |
+| Language used in your messages (approx, longer messages only) | French 0.0%, English 32.6%, Spanish 67.3%, Catalan 0.1% |
 | Custom-reply slow end | 90% of custom replies are faster than **16h 36m** |

 ### Key Patterns Over Time
@ -203,30 +203,34 @@ Breakdown by message type (repeat messages vs custom messages):

 > Matching the other person’s language increases trust and reduces back-and-forth.

-This is an approximate language guess based on the text itself. Short one-word messages are harder to classify and may show up as “Unknown”.
+This is an approximate language guess based on the text itself. Very short messages (one word, emoji, or numbers) are harder to classify and are marked as “Too short to tell”.

-| Language | Messages from people | Messages you sent (text only) |
-|---|---:|---:|
-| Spanish | 2,662 (25.7%) | 17,804 (63.8%) |
-| English | 1,870 (18.1%) | 8,136 (29.2%) |
-| Unknown | 5,814 (56.2%) | 1,961 (7.0%) |
+| Language | Messages from people | Share (enough text) | Share (all text) | Messages you sent (text only) | Share (enough text) | Share (all text) |
+|---|---:|---:|---:|---:|---:|---:|
+| French | 191 | 3.6% | 1.8% | 1 | 0.0% | 0.0% |
+| English | 2,861 | 54.2% | 27.7% | 8,840 | 32.6% | 31.7% |
+| Spanish | 2,208 | 41.8% | 21.3% | 18,224 | 67.3% | 65.3% |
+| Catalan | 16 | 0.3% | 0.2% | 17 | 0.1% | 0.1% |
+| Too short to tell | 5,070 | n/a | 49.0% | 819 | n/a | 2.9% |

-Language split chart (messages from people):
+Language split chart (messages from people, enough text only):

 ```mermaid
 pie title Language Split (Messages From People)
-  "Spanish" : 2662
-  "English" : 1870
-  "Unknown" : 5814
+  "French" : 191
+  "English" : 2861
+  "Spanish" : 2208
+  "Catalan" : 16
 ```

-Language split chart (messages you sent):
+Language split chart (messages you sent, enough text only):

 ```mermaid
 pie title Language Split (Messages You Sent)
-  "Spanish" : 17804
-  "English" : 8136
-  "Unknown" : 1961
+  "French" : 1
+  "English" : 8840
+  "Spanish" : 18224
+  "Catalan" : 17
 ```

 ### Top 20 Things People Ask or Type (Ranked)
--- a/sergio_instagram_messaging/generate_dm_report_detailed.py
+++ b/sergio_instagram_messaging/generate_dm_report_detailed.py
@ -127,20 +127,194 @@ def _tokens(text: str) -> list[str]:
    return re.findall(r"\b[\w']+\b", (text or "").lower())


-def _guess_lang(text: str) -> Literal["English", "Spanish", "Unknown"]:
+def _guess_lang(text: str) -> Literal["English", "Spanish", "French", "Catalan", "Too short to tell"]:
    s = (text or "").lower()
    if not s.strip():
-        return "Unknown"
+        return "Too short to tell"
+
+    # If there are no letters, we cannot infer language.
+    if not re.search(r"[a-záéíóúñüàâçèêëîïôœùûÿ·]", s):
+        return "Too short to tell"
+
+    # Accent-based hints (fast path).
+    if any(ch in s for ch in ("¿", "¡", "ñ")):
+        return "Spanish"
+    if "l·l" in s or "·" in s:
+        return "Catalan"
+    if any(ch in s for ch in ("œ", "æ")):
+        return "French"
+
+    toks_list = _tokens(s)
+    if not toks_list:
+        return "Too short to tell"
+
+    # One-word messages: only classify if the word is strongly tied to a language.
+    if len(toks_list) == 1:
+        w = toks_list[0]
+        if w in {"libro", "gracias", "precio", "enlace", "llamada", "cita", "hola", "vale"}:
+            return "Spanish"
+        if w in {"livre", "merci", "prix", "lien", "appel", "bonjour", "salut"}:
+            return "French"
+        if w in {"llibre", "gracies", "gràcies", "preu", "enllac", "enllaç", "trucada", "hola"}:
+            return "Catalan"
+        if w in {"book", "ebook", "price", "link", "call", "hello", "thanks"}:
+            return "English"
+        return "Too short to tell"
+
+    toks = set(toks_list)
+
+    # Stopword-style scoring (small, robust sets).
+    english = {
+        "i",
+        "you",
+        "we",
+        "the",
+        "a",
+        "an",
+        "and",
+        "or",
+        "but",
+        "because",
+        "thanks",
+        "thank",
+        "hello",
+        "hi",
+        "please",
+        "what",
+        "how",
+        "where",
+        "when",
+        "why",
+        "can",
+        "could",
+        "book",
+        "ebook",
+        "price",
+        "link",
+        "call",
+    }
+    spanish = {
+        "yo",
+        "tu",
+        "tú",
+        "usted",
+        "nosotros",
+        "el",
+        "la",
+        "los",
+        "las",
+        "un",
+        "una",
+        "y",
+        "o",
+        "pero",
+        "porque",
+        "por",
+        "para",
+        "con",
+        "que",
+        "cómo",
+        "como",
+        "cuando",
+        "cuándo",
+        "gracias",
+        "hola",
+        "precio",
+        "libro",
+        "enlace",
+        "llamada",
+        "cita",
+        "taller",
+        "curso",
+    }
+    french = {
+        "je",
+        "tu",
+        "vous",
+        "nous",
+        "le",
+        "la",
+        "les",
+        "un",
+        "une",
+        "et",
+        "ou",
+        "mais",
+        "parce",
+        "pour",
+        "avec",
+        "dans",
+        "sur",
+        "quoi",
+        "comment",
+        "quand",
+        "merci",
+        "bonjour",
+        "salut",
+        "prix",
+        "livre",
+        "lien",
+        "appel",
+        "rendezvous",
+        "rdv",
+    }
+    catalan = {
+        "jo",
+        "tu",
+        "vostè",
+        "vosaltres",
+        "el",
+        "la",
+        "els",
+        "les",
+        "un",
+        "una",
+        "i",
+        "o",
+        "però",
+        "pero",
+        "perque",
+        "perquè",
+        "per",
+        "amb",
+        "que",
+        "com",
+        "quan",
+        "gracies",
+        "gràcies",
+        "hola",
+        "preu",
+        "llibre",
+        "enllac",
+        "enllaç",
+        "trucada",
+    }
+
+    scores = {
+        "English": len(toks & english),
+        "Spanish": len(toks & spanish),
+        "French": len(toks & french),
+        "Catalan": len(toks & catalan),
+    }
+
+    # Accent boosts (helps with French/Catalan/Spanish split).
    if any(ch in s for ch in "áéíóúñü¿¡"):
-        return "Spanish"
-    toks = set(_tokens(s))
-    es_hits = len(toks & {"que", "para", "pero", "porque", "gracias", "hola", "vale", "listo", "precio", "libro", "enlace", "llamada", "cita", "taller", "curso"})
-    en_hits = len(toks & {"the", "and", "but", "because", "thanks", "hello", "price", "ready", "book", "ebook", "link", "call"})
-    if es_hits > en_hits:
-        return "Spanish"
-    if en_hits > es_hits:
-        return "English"
-    return "Unknown"
+        scores["Spanish"] += 2
+    if any(ch in s for ch in "àâçèéêëîïôœùûÿ"):
+        scores["French"] += 2
+    if any(ch in s for ch in "àèéíïòóúüç·"):
+        scores["Catalan"] += 1
+
+    best_lang, best_score = max(scores.items(), key=lambda kv: kv[1])
+    if best_score <= 0:
+        return "Too short to tell"
+
+    # If it is a near tie, do not overclaim.
+    sorted_scores = sorted(scores.values(), reverse=True)
+    if len(sorted_scores) >= 2 and sorted_scores[0] == sorted_scores[1]:
+        return "Too short to tell"
+
+    return best_lang  # type: ignore[return-value]


 def _extract_text(msg: dict[str, Any]) -> str | None:
@ -826,14 +1000,29 @@ def generate_report(
        reverse=True,
    )[:2]
    top_time_blocks_s = ", ".join(top_time_blocks) if top_time_blocks else "n/a"
-    in_lang_total = sum(lang_in.values()) or 1
-    out_lang_total = sum(lang_out.values()) or 1
+    primary_langs = ("French", "English", "Spanish", "Catalan")
+    in_lang_total = sum(lang_in.values())
+    out_lang_total = sum(lang_out.values())
+    in_lang_primary_total = sum(int(lang_in.get(lang, 0)) for lang in primary_langs)
+    out_lang_primary_total = sum(int(lang_out.get(lang, 0)) for lang in primary_langs)
    in_lang_summary = (
-        ", ".join([f"{lang} {_pct(int(lang_in.get(lang, 0)), in_lang_total)}" for lang in ("Spanish", "English", "Unknown") if int(lang_in.get(lang, 0))])
+        ", ".join(
+            [
+                f"{lang} {_pct(int(lang_in.get(lang, 0)), in_lang_primary_total)}"
+                for lang in primary_langs
+                if int(lang_in.get(lang, 0))
+            ]
+        )
        or "n/a"
    )
    out_lang_summary = (
-        ", ".join([f"{lang} {_pct(int(lang_out.get(lang, 0)), out_lang_total)}" for lang in ("Spanish", "English", "Unknown") if int(lang_out.get(lang, 0))])
+        ", ".join(
+            [
+                f"{lang} {_pct(int(lang_out.get(lang, 0)), out_lang_primary_total)}"
+                for lang in primary_langs
+                if int(lang_out.get(lang, 0))
+            ]
+        )
        or "n/a"
    )

@ -855,8 +1044,8 @@ def generate_report(
    report.append(f"| Questions/requests cluster on **{top_question_days_s}** | Those two days contain most of the asking in this export |")
    report.append(f"| Most messages arrive in **{top_time_blocks_s} ({time_zone_label})** | If you are present in those blocks, reply rates usually improve |")
    report.append(f"| Repeat messages make up **{_pct(scripted_outbound_with_text, outbound_with_text)}** of your text messages | Fast replies are often repeats; custom replies are where delays happen |")
-    report.append(f"| Language used by people (approx) | {in_lang_summary} |")
-    report.append(f"| Language used in your messages (approx) | {out_lang_summary} |")
+    report.append(f"| Language used by people (approx, longer messages only) | {in_lang_summary} |")
+    report.append(f"| Language used in your messages (approx, longer messages only) | {out_lang_summary} |")
    report.append(f"| Custom-reply slow end | 90% of custom replies are faster than **{_fmt_duration(slow_custom_reply_s)}** |")
    report.append("")

@ -1014,31 +1203,33 @@ def generate_report(
    report.append("> Matching the other person’s language increases trust and reduces back-and-forth.")
    report.append("")
    report.append(
-        "This is an approximate language guess based on the text itself. Short one-word messages are harder to classify and may show up as “Unknown”."
+        "This is an approximate language guess based on the text itself. Very short messages (one word, emoji, or numbers) are harder to classify and are marked as “Too short to tell”."
    )
    report.append("")
-    report.append("| Language | Messages from people | Messages you sent (text only) |")
-    report.append("|---|---:|---:|")
-    for lang in ("Spanish", "English", "Unknown"):
+    report.append("| Language | Messages from people | Share (enough text) | Share (all text) | Messages you sent (text only) | Share (enough text) | Share (all text) |")
+    report.append("|---|---:|---:|---:|---:|---:|---:|")
+    for lang in (*primary_langs, "Too short to tell"):
+        in_cnt = int(lang_in.get(lang, 0))
+        out_cnt = int(lang_out.get(lang, 0))
        report.append(
-            f"| {lang} | {lang_in.get(lang, 0):,} ({_pct(lang_in.get(lang, 0), in_lang_total)}) | {lang_out.get(lang, 0):,} ({_pct(lang_out.get(lang, 0), out_lang_total)}) |"
+            f"| {lang} | {in_cnt:,} | {_pct(in_cnt, in_lang_primary_total) if lang in primary_langs else 'n/a'} | {_pct(in_cnt, in_lang_total)} | {out_cnt:,} | {_pct(out_cnt, out_lang_primary_total) if lang in primary_langs else 'n/a'} | {_pct(out_cnt, out_lang_total)} |"
        )
    report.append("")
-    report.append("Language split chart (messages from people):")
+    report.append("Language split chart (messages from people, enough text only):")
    report.append("")
    report.append("```mermaid")
    report.append("pie title Language Split (Messages From People)")
-    for lang in ("Spanish", "English", "Unknown"):
+    for lang in primary_langs:
        v = int(lang_in.get(lang, 0))
        if v:
            report.append(f'  \"{lang}\" : {v}')
    report.append("```")
    report.append("")
-    report.append("Language split chart (messages you sent):")
+    report.append("Language split chart (messages you sent, enough text only):")
    report.append("")
    report.append("```mermaid")
    report.append("pie title Language Split (Messages You Sent)")
-    for lang in ("Spanish", "English", "Unknown"):
+    for lang in primary_langs:
        v = int(lang_out.get(lang, 0))
        if v:
            report.append(f'  \"{lang}\" : {v}')