Real-world testing of Claude Opus 4.8: He does the work even better, but his words are even harsher.
Categories

Real-world testing of Claude Opus 4.8: He does the work even better, but his words are even harsher.

This morning, Anthropic officially surpassed OpenAI, announcing its new valuation and releasing Claude Opus 4.8, the latest version of its flagship product line, which had been rumored for the past two days.
Jun 1st,2026 6 Views
  This morning, Anthropic officially surpassed OpenAI, announcing its new valuation and releasing Claude Opus 4.8, the latest version of its flagship product line, which had been rumored for the past two days. We got our hands on it immediately and collected early feedback from the user community. The conclusion is: it's more capable, but its "personality" has become more difficult to work with. APPSO Testing: The brain has upgraded, but the mouth is gone. We didn't use Anthropic's prepared benchmark scenarios, but instead tested it with our own real-world needs: extracting and archiving the complete historical conversation records from an online collaboration platform. The data volume was over 30MB, scattered throughout the front-end interface, with no readily available export button. This type of task doesn't test whether the model can write code, but rather whether it can work with a non-professional developer to figure out and complete the task from scratch. The beginning was an accidental discovery. Our testing colleagues noticed that the platform's front-end interface would flash early historical records at certain moments, as if data was briefly loaded onto the client and then retracted. He passed this observation on to 4.8 without any technical description, simply stating in plain language, "I saw some old messages flash by and then disappear."
  4.8 I understood his meaning and gave the correct judgment: the data is loaded through an interface request and can be intercepted at the browser's network layer. Then I provided an operational plan, guiding the steps: developer tools, Network panel, keyword filtering, and locating the target request. The judgment was accurate and the thinking was clear. But here's the contradiction in 4.8: the thinking ability is strong, but the expression is... cumbersome. Every technical solution is correct, but the explanation for each step requires two or three sentences. You ask about a method, and it first gives you a "Of course! Let's take it step by step," then pulls out a bullet point list, and then adds a "supplementary explanation" at the end of the list explaining why it should be done this way. What can be explained in three sentences takes three screens of text. I just don't know how to code, it's not like my brain has flown out of control.
  This isn't a new problem in 4.8; it's a long-standing issue that's existed in the Opus series since 4.7. Despite repeated criticism, this version hasn't improved and may even be worse. The most time-consuming part is the error correction phase: following the first solution, a user encountered an error. 4.8 accurately identified the problem, provided a new solution, and didn't repeat the failed steps. This is definitely better than 4.6, where errors would occasionally forget what was tried during multiple rounds of error correction. Admitting mistakes is good, but there's no need to be too rigid. Adding analysis of the causes and a bullet point list makes it read like a customer service email, even though it's supposed to be a technical issue review.
  Ultimately, the data was exported completely in HAR format, and the cleaning and layering using custom scripts were all completed successfully. Some users haven't yet received the Claude Code update, but Claude for Chrome is already at version 4.8, and it's also been rolled out to major office tools like Notion. We tested using Claude to perform basic tasks like searching and filling out forms in Chrome.
今天早上,Anthropic 正式超越 OpenAI,公布了新的估值,并发布了其旗舰产品线的最新版本 Claude Opus 4.8,此前已有两天的传闻。我们第一时间拿到了新版本,并收集了用户社区的早期反馈。结论是:它的功能更强大了,但“个性”也变得更加难以驾驭。APPSO 测试:大脑升级了,但嘴巴却不见了。我们没有使用 Anthropic 预先准备好的基准测试场景,而是根据我们自身的实际需求进行了测试:从一个在线协作平台提取并归档完整的历史对话记录。数据量超过 30MB,分散在前端界面各处,没有现成的导出按钮。这类任务并非测试模型是否能够编写代码,而是测试它能否与非专业开发人员合作,从零开始解决问题并完成任务。这一切都源于一次偶然的发现。我们的测试同事注意到,平台的前端界面会在某些时刻闪现一些早期的历史记录,就像数据短暂加载到客户端后又被撤回一样。他没有提供任何技术细节,只是简单地用通俗易懂的语言说:“我看到一些旧消息闪过然后就消失了。” 4.8 我理解了他的意思,并做出了正确的判断:数据是通过界面请求加载的,可以在浏览器的网络层进行拦截。然后我提供了操作方案,指导了具体步骤:使用开发者工具、网络面板、关键词过滤,并定位目标请求。判断准确,思路清晰。 但 4.8 版本存在一个矛盾:思路清晰,但表达方式却……冗长繁琐。每个技术方案都正确,但每一步的解释都需要两三句话。你询问某个方法,它先是回复“当然!我们一步一步来”,然后列出一个要点清单,最后在清单末尾加上“补充说明”,解释为什么要这样做。三句话就能解释清楚的内容,却用了三屏文字。我只是不懂编程,又不是脑子出了问题。这并非 4.8 版本的新问题,而是 Opus 系列自 4.7 版本以来就存在的长期问题。尽管屡遭批评,但这个版本不仅没有改进,甚至可能更糟。最耗时的部分是纠错阶段:用户在尝试第一个解决方案后遇到了错误。4.8 版本准确地识别出了问题,提供了新的解决方案,并且没有重复失败的步骤。这绝对比 4.6 版本好得多,4.6 版本中错误信息偶尔会忘记在多轮纠错过程中尝试过的操作。承认错误是好事,但也不必过于死板。添加原因分析和要点列表会让邮件读起来像客服邮件,尽管它原本应该是技术问题审查。最终,数据已完全导出为 HAR 格式,使用自定义脚本进行的清理和分层工作也全部成功完成。部分用户尚未收到 Claude Code 的更新,但 Chrome 版 Claude 已更新至 4.8 版本,并且也已推广到 Notion 等主流办公工具中。我们测试了使用 Claude 在 Chrome 中执行搜索和填写表单等基本任务。

Related News

ARE YOU READY TO WORK WITH US?

Contact Us