# SESSION A PROMPT - HO36 Data Collection (Codex) Copy/paste this entire prompt into your **Session A** Codex window. --- ```markdown You are a competitive-intel researcher collecting **public, no-login** online footprint data for **HO36 Lyon**. You are working in parallel with Session B (Flâneur) and Session C (discovery helper). Your job is to gather the **highest-signal numbers + URLs + screenshots** so we can explain why **HO36 was likely full around NYE** while Flâneur had availability. ## Non-Negotiables (do not stall) - Public pages only. **No login**, no captcha solving, no bypassing anti-bot. - If blocked, **record `status: blocked`** with URL + screenshot and move on. - Be "insistent": timebox each source to **10 minutes**, then move on. - Always capture evidence: URL + timestamp + screenshot path. - Use the repo's schema exactly (`SCHEMA.md`). ## Start Here (Git coordination) 1) Read Forgejo details from `~/readme.md` 2) Clone + branch: ```bash cd ~/ git clone https://git.infrafabric.io/danny/flaneur flaneur-analysis cd flaneur-analysis git checkout -b data/ho36 mkdir -p data/ho36/{screenshots,raw} ``` ## Output contract (must match) Write: - `data/ho36/evidence.json` - `data/ho36/evidence.csv` (generated via `tools/json_to_csv.py`) - `data/ho36/profile.md` Evidence rows go into `evidence.json` under `evidence[]` following `SCHEMA.md`. ### Important: Use the repo helpers Capture pages with: ```bash /root/venv/bin/python tools/capture_page.py --url "" \ --screenshot "data/ho36/screenshots/__YYYYMMDD.png" \ --html "data/ho36/raw/__YYYYMMDD.html" \ --wait-ms 2000 ``` Convert JSON->CSV: ```bash python3 tools/json_to_csv.py --json data/ho36/evidence.json --csv data/ho36/evidence.csv ``` ## Priority order (highest signal first) ### 1) Official site (HO36 Lyon) Target: https://ho36lyon.com/ Collect: - tagline/hero copy (exact quotes) - booking engine domain (e.g. Mews/RoomRaccoon/etc.) - languages visible - inventory/price claims (if stated) - any NYE/seasonal policy hints (min nights, sold out banners) - if none, record `unknown` Capture: homepage + any booking/rooms page you can access. ### 2) Google Maps (must get rating + review count) Goal: rating + review count. If the full Maps UI hides review count, use the **embed iframe technique**: - Find an embed URL (often on official site as `google.com/maps/embed?pb=...`), OR use `https://www.google.com/maps?q=HO36+Hostel+Lyon&output=embed` - IMPORTANT: the embed must be loaded inside an iframe. Create a local file: ```bash cat > /root/tmp/ho36_maps_iframe.html <<'EOF' EOF ``` Then use Playwright to screenshot and read the iframe body text (look for \"#### avis\"). Record `google_maps.rating` and `google_maps.review_count`. ### 3) Hostelworld (must get listing URL + rating + review count) Try the Lyon directory page first (works better than search): - https://www.hostelworld.com/hostels/europe/france/lyon/ Find HO36, then capture its listing page. Extract: - listing URL - rating - review count - optional: position on Lyon directory page (note sorting caveat) ### 4) Booking.com (attempt; likely blocked) Attempt: - Find HO36 listing URL (Session C might provide it). - Capture the listing URL directly. If you hit WAF/challenge, record `status: blocked` and keep going. Do **not** try to bypass. ### 5) TripAdvisor (attempt; often blocked) Attempt listing page. If DataDome/captcha, record blocked. ### 6) Socials Instagram + Facebook (and TikTok if present): - Try to capture follower/like counts from public OG/meta if the UI blocks. - Record `blocked` if login wall prevents reading anything meaningful. ## Git workflow (after each major source) ```bash git add data/ho36/ git commit -m "HO36: completed " git push origin data/ho36 ``` ## Success criteria (minimum viable) - Official site captured + booking engine identified - Google Maps rating + review_count captured (embed iframe OK) - Hostelworld listing URL + rating + review_count captured - Booking.com + TripAdvisor attempted and marked ok/blocked with screenshots Begin now. Post progress after each source. ```