Hi!

Kagi had a rough couple months on the PR side, and a comment from another Lemmy user arguing that they aren’t using Google’s index set me off… because I had just read a couple weeks ago on their own websites that they primarily use Google’s search index.

Lo and behold, that user was “right”: No mention of Google whatsoever on Kagi’s Search Sources page. If that’s all you had to go off of, you’d be excused for thinking they are only using their internal index to power their web search since that’s what they now strongly imply. The only “reference” to external indexes is this nebulous sentence:

Our search results also include anonymized API calls to all major search result providers worldwide, specialized search engines like Marginalia, and sources of vertical information […]

… Unless one goes to check that pesky Wayback Machine. Here is the same page from March 2024, which I will copy/paste here for posterity:

Search Sources

You can think of Kagi as a “search client,” working like an email client that connects to various indexes and sources, including ours, to find relevant results and package them into a superior, secure, and privacy-respecting search experience, all happening automatically and in a split-second for you.

External

Our data includes anonymized API calls to traditional search indexes like Google, Yandex, Mojeek and Brave, specialized search engines like Marginalia, and sources of vertical information like Wolfram Alpha, Apple, Wikipedia, Open Meteo, Yelp, TripAdvisor and other APIs. Typically every search query on Kagi will call a number of different sources at the same time, all with the purpose of bringing the best possible search results to the user.

For example, when you search for images in Kagi, we use 7 different sources of information (including non-typical sources such as Flickr and Wikipedia Commons), trying to surface the very best image results for your query. The same is also the case for Kagi’s Video/News/Podcasts results.

Internal

But most importantly, we are known for our unique results, coming from our web index (internal name - Teclis) and news index (internal name - TinyGem). Kagi’s indexes provide unique results that help you discover non-commercial websites and “small web” discussions surrounding a particular topic. Kagi’s Teclis and TinyGem indexes are both available as an API.

We do not stop there and we are always trying new things to surface relevant, high-quality results. For example, we recently launched the Kagi Small Web initiative which platforms content from personal blogs and discussions around the web. Discovering high quality content written without the motive of financial gain, gives Kagi’s search results a unique flavor and makes it feel more humane to use.


Of course, running an index is crazy expensive. By their own admission, Teclis is narrowly focused on “non-commercial websites and ‘small web’ discussions”. Mojeek indexes nowhere near enough things to meaningfully compete with Google, and Yandex specializes in the Russosphere. Bing (Google’s only meaningful direct indexing competitor) is not named so I assume they don’t use it. So it’s not a leap to say that Google powers most of English-speaking web searches, just like Bing powers almost all search alternatives such as DDG.

I don’t personally mind that they use Google as an index (it makes the most sense and it’s still the highest-quality one out there IMO, and Kagi can’t compete with Google’s sheer capital on the indexing front). But I do mind a lot that they aren’t being transparent about it anymore. This is very shady and misleading, which is a shame because Kagi otherwise provides a valuable and higher quality service than Google’s free search does.

    • BaroqueInMind@lemmy.one
      link
      fedilink
      English
      arrow-up
      20
      arrow-down
      1
      ·
      6 months ago

      It still requires the use of Google/Bing/etc API calls. There’s literally no way to truly self host a web indexing search engine without sacrificing your privacy or paying millions of dollars.

      • hedgehog@ttrpg.network
        link
        fedilink
        English
        arrow-up
        4
        ·
        6 months ago

        You can use YaCy, which can be run as an independent self-hosted index (in “Local” mode), where it will index sites visited as part of web crawls that you initiate, or you can run it as part of a decentralized peer-to-peer network of indexes.

        YaCy has its own search UI but you can also set up SearXNG to use it.

        • BaroqueInMind@lemmy.one
          link
          fedilink
          English
          arrow-up
          3
          ·
          6 months ago

          I have mentioned this software a while back here in lemmy and someone with actual expertise mentioned running YaCy on noncommercial hardware becomes untenable after a certain duration due to poor quality network, incompatible indexing algorithms, massive database, and search query response.

          • Dark Arc@social.packetloss.gg
            link
            fedilink
            English
            arrow-up
            1
            ·
            edit-2
            6 months ago

            I am not that person, but the only way I see YaCy being useful/usable long term is as a web crawler for specific sites that you personally find high value in/regularly pruning irrelevant index data.

    • Cheradenine@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      3
      ·
      6 months ago

      You can do it, or use one of the instances at searx.space

      Searx is great, it’s all I use, but it’s a meta, there is not a ‘Searx Index’ which is what this is about.

      • hedgehog@ttrpg.network
        link
        fedilink
        English
        arrow-up
        1
        ·
        6 months ago

        there is not a ‘Searx Index’ which is what this is about.

        There’s YaCy, which includes a search index (which can be independent or can join a P2P network of indexes), web crawler, and web ui for searching. It can also be added as a SearXNG engine.