AI Newsletter

A personalized AI-powered newsletter platform that delivers curated content based on user preferences and consumption patterns.
The AI Newsletter platform uses natural language processing and machine learning to analyze, filter, and deliver content tailored to individual users. It learns from interactions to continuously improve recommendations.
Daily News Digest
A system for automatically collecting, summarizing, and emailing news articles.
Overview
This system fetches news articles from multiple sources, uses AI to summarize them, and sends the digest via email. It's designed to focus on news most relevant to the user's interests and can be deployed in various environments, including as a serverless function.
Features
- Two-Phase News Fetching: Efficiently collects headlines and descriptions, then uses AI to select articles for full scraping.
- Rationale: Manages LLM context window limitations and reduces token costs by deferring full content parsing.
- Limitation: Potential scaling issues with a large number of articles.
- AI-Powered Summarization: Leverages GPT-4o for personalized news digests.
- Benefit: Provides a main headline for a quick overview.
- Issue: AI summaries may occasionally contain inaccuracies (hallucinations).
- Multiple News Sources: Scrapes from various platforms, including TechCrunch and CNN.
- Category Filtering: Organizes news based on website categories (e.g., AI, technology).
- Limitation: Direct mapping can be restrictive for overlapping topics (e.g., AI, technology, apps) categorized differently across sites.
- Customizable Prompts: Allows users to tailor the AI summarization prompt.
- Email Delivery: Sends formatted HTML emails using SendGrid.
How it Works
- Phase 1: Initial Scraping:
- Fetches headlines, URLs, and short descriptions from news source category or search result pages.
- Focuses on readily available information on listing pages.
- Defers full article content download to save bandwidth and processing time.
- Phase 2: AI Selection:
- Employs OpenAI to analyze headlines and descriptions.
- Selects articles based on user-defined interests (e.g., AI, tech innovations).
- Phase 3: Detailed Scraping:
- Fetches full content only for the AI-selected articles.
- Optimizes bandwidth and processing by scraping only necessary content.
- Phase 4: Summarization:
- Sends selected articles to GPT-4o for summarization.
- Creates a personalized digest highlighting key information.
- Phase 5: Email Delivery:
- Formats the summary into an HTML email.
- Sends the digest to specified recipients via SendGrid.
Usage
Basic Run
python -m __main__ --test
Custom Run
python -m __main__ --test --categories ai,technology --emails someone@example.com --interests "AI in education, GPT-4o updates"
Configuration
Key settings are configured via environment variables:
OPENAI_API_KEY: Your OpenAI API key for summarization.SENDGRID_API_KEY: Your SendGrid API key for email delivery.
Scraper System Design
The news scraping functionality is modular and easily extensible, utilizing a base scraper class and a scraper manager.
-
scrapers/base.py:- Defines
NewsArticle: ATypedDictstandardizing the structure for fetched news articles (title, URL, description, content, etc.). - Defines
NewsScraper(ABC): An abstract base class for individual news source scrapers, providing a common interface and shared functionalities (e.g., HTML fetching viarequests, parsing withBeautifulSoup). - Concrete scrapers must implement source-specific abstract methods:
_select_article_elements(): Finds article entries on a category page._extract_article_from_list_item(): Parses basic article info (headline, URL, short description) for Phase 1.fetch_article_by_url(): Fetches a single, complete article by its URL._extract_article_info(): Parses full content and details from an individual article page for Phase 3.
- Defines
-
scrapers/<source_name>.py(e.g.,scrapers/techcrunch.py,scrapers/cnn.py):- Each file implements a scraper class inheriting from
NewsScraper. - Contains unique selectors and parsing logic tailored to the specific news website's HTML structure.
- Each file implements a scraper class inheriting from
-
scrapers/manager.py:- Defines
ScraperManager: A central registry and coordinator for all scrapers. - Initializes and stores instances of concrete scrapers (e.g.,
TechCrunchScraper(),CNNScraper()) in a dictionary, keyed by a source identifier (e.g., "techcrunch"). - Provides methods like
fetch_headlines()(Phase 1) andfetch_detailed_content()(Phase 3), delegating work to the appropriate scraper.
- Defines
Adding a New News Source
To integrate a new news source (e.g., "NewSite"):
- Create Scraper Module: Add
scrapers/newsite_scraper.py. - Implement Scraper Class: In
newsite_scraper.py, defineNewSiteScraper(NewsScraper).- Implement all
NewsScraperabstract methods with logic specific to NewSite's website structure.
- Implement all
- Register in
__init__.py: Inscrapers/__init__.py, addfrom . import newsite_scraperto make the module accessible. - Register in Manager: In
scrapers/manager.py:- Import the new class:
from .newsite_scraper import NewSiteScraper. - Add an instance to
self.scrapersinScraperManager.__init__:"newsite": NewSiteScraper(). (Use a consistent key for the source).
- Import the new class:
The ScraperManager and main application logic will then automatically support the new source.
SCMP API Key
The SCMPScraper uses an API key to fetch articles from the South China Morning Post's content API. This key is currently hardcoded in scrapers/scmp_scraper.py within the SCMP_API_HEADERS dictionary.
Purpose of the API Key:
This API key is likely used by SCMP's own website frontend to authenticate its requests to their backend API. It's a common practice for websites to use such keys to:
- Identify legitimate requests originating from their own applications.
- Apply general rate limiting or quotas.
- Provide a basic layer of access control to their internal APIs.
The key is not unique to your session but is rather a general key embedded in SCMP's frontend code.
How to Find/Update the API Key if it Changes:
If the SCMPScraper starts failing with authentication errors (e.g., HTTP 401 or 403), the API key might have been changed by SCMP. To find the new key:
- Open the SCMP website (e.g.,
https://www.scmp.com) in your web browser. - Open your browser's Developer Tools (usually by pressing F12 or right-clicking on the page and selecting "Inspect" or "Inspect Element").
- Navigate to the "Network" tab within the Developer Tools.
- Filter the requests if possible (e.g., by "Fetch/XHR" or by looking for requests to
apigw.scmp.com). - As you browse the SCMP site (e.g., by loading the homepage or a category page), new network requests will appear in the list.
- Look for requests made to
apigw.scmp.com(specifically those related tocontent-delivery/v2or withoperationName=aroundHomeQuery). - Click on one of these requests to view its details.
- Examine the "Headers" section (specifically "Request Headers"). Look for an
apikeyheader or a similar authorization-related header. - If the value of this key is different from the one in
scrapers/scmp_scraper.py, update theSCMP_API_HEADERSdictionary in the script with the new key.
While this key might be stable for extended periods, it's good practice to know how to find it if the scraper stops working due to authentication issues. Automating the fetching of this key is possible but can be brittle and is likely not necessary unless the key changes very frequently.
Future Improvements
This section outlines potential enhancements:
Data Processing and Efficiency
- Persistent Storage: Implement a database (e.g., SQLite, PostgreSQL) to store processed article information.
- Benefits: Avoid reprocessing, save API calls, track article history.
- Individual Article Summarization: Summarize selected articles individually (cost permitting).
- Potential: Reduce AI hallucinations, but increases LLM API costs.
AI and Personalization
- Advanced Article Selection: Explore methods beyond current LLM capabilities for two-phase fetching:
- Recommendation algorithms (user history, feedback).
- Retrieval Augmented Generation (RAG) for context-aware selection.
- Enhanced Category Matching: Improve category filtering beyond direct website mapping:
- Fuzzy matching or semantic similarity for user-defined interests, especially for nuanced topics.
Scraping Capabilities
- Support for Single Page Applications (SPAs): Integrate browser automation (Selenium, Playwright) for JavaScript-driven content.
- Handling Pagination and Infinite Scrolling: Enhance scrapers to:
- Detect and follow "next page" links.
- Simulate scrolling for sites with infinite scrolling.