通过 https://www.zhihu.com/billboard ，爬取页面html解析数据，查不到跳转链接怎么搞？

爬取知乎热榜数据，跳转链接从哪里爬取？

已解决，使用base64编码转换，获取文章ID

主要实现如下：

很多网站都有反爬虫，所以使用puppeteer来爬取数据，我使用的是Nest.js。

例如掘金热搜如何获取，其他的都类似。

控制器hot.controller.ts

  @Get('juejin')//路由
  @Public() //此接口无需token可调用
  getHotSearchJuejin() {
    return this.weiboService.getHotSearchJuejin();
  }

服务hot.service

import puppeteer from 'puppeteer';
import axios from 'axios';
import { createWindow } from 'domino';

@Injectable()
export class WeiboService {
  private hotJueJinData: any[] = [];  //存储数据

  private async fetchHTMLWithPuppeteer(url: string): Promise<string> {
    const browser = await puppeteer.launch({
      //  executablePath: '/usr/bin/chromium-browser', // 指定 Chromium 路径,服务器部署需要
      headless: true,
      args: [
        '--disable-gpu',
        '--disable-dev-shm-usage',
        '--disable-setuid-sandbox',
        '--no-first-run',
        '--no-sandbox',
        '--no-zygote',
        '--single-process',
      ],
    });
    const page = await browser.newPage(); //打开无头浏览器浏览器
    await page.goto(url, { waitUntil: 'networkidle2' }); // 等待页面加载完成
    const html = await page.content(); // 获取渲染后的 HTML
    await browser.close(); //关闭浏览器
    return html;
  }

  private parseHTML(html: string, selector: string, href?: string): any[] {
    const window = createWindow(html);
    const document = window.document;
    const elements = document.querySelectorAll(selector);
    if (!href)
      return Array.from(elements).map((element: any) =>
        element.textContent.trim(),
      );
    if (href)
      return Array.from(elements).map((element: any) =>
        element.getAttribute(href),
      );
  }

  async fetchJueJinHotSearch() {
    const url = 'https://juejin.cn/hot';
    const html = await this.fetchHTMLWithPuppeteer(url);
    const title = this.parseHTML(html, '.article-title');
    const hot = this.parseHTML(html, '.hot-number');
    const href = this.parseHTML(html, '.article-item-link', 'href');

    this.hotJueJinData = title.map((item, index) => {
      return {
        note: item,
        num: hot[index],
        href: 'https://juejin.cn' + href[index],
        type: 'juejin',
      };
    });

    console.log('掘金', this.hotJueJinData);
    return this.hotJueJinData;
  }

  // 获取热搜数据
  async getHotSearchJuejin() {
    if (this.hotJueJinData.length > 0) return this.hotJueJinData;
    return await this.fetchJueJinHotSearch();
  }
}

基本上就是这样实现，后面可以继续优化，可以执行定时任务获取，每次掉接口直接查询缓存数据接可以了。puppeteer这块也是可以优化的，创建浏览器池，配置pm2、负载均衡、分配cpu等

const getReactInstance = (dom,prefix) => { if (!dom) { return; } const __reactKey = Object.keys(dom || {}).filter(key => key.startsWith(prefix))?.[0]; if (__reactKey && __reactKey in dom) { return dom[__reactKey]; } } const reactInstance = getReactInstance(document.querySelectorAll('.HotList-item')[0], '__reactFiber$') reactInstance.return.memoizedProps.item.target.link.url

爬取知乎热榜数据，跳转链接从哪里爬取？

你尚未登录，登录后可以

js 如何将Key属性相同的放在同一个数组？

如何防止接口的 key 泄露?

字节的 trae AI IDE 不支持类似 vscode 的 ssh remote 远程开发怎么办？

在购买页面，这里有： for 1 month, for 3 months，这里说的意思是什么呢？

请问开发React Native，一般是推荐哪个主流的UI库呢？

快开发完的Vue3项目要做SEO该如何处理？

www.baidu.com 中的 baidu 被称为什么域名？