You’ve probably clocked our claims that our activities are ‘backed by science’, or ‘grounded in evidence’. The wellness industry is flooded with claims like these, but rarely does anyone show their work. So, like any sane citizen of the internet, you might be wondering if we are full of...it. So here we will we lay it all out for you; exactly how we calculate the activity evidence ratings which ultimately power our recommendations.
The scoring challenge
Even within academia, there is no universally agreed way to rank research quality. Some researchers prioritise large sample sizes, others focus on study design, and still others emphasise whether the findings have been replicated. All are helpful. None are perfect.
So this creates a practical problem. When you want to know whether meditation really helps with stress, you’ll find dozens of studies with conflicting designs, sample sizes, and conclusions. How do you decide which ones to trust? How do you weigh a small, carefully controlled study against a large observational one?
The gold standard approach in academia is to conduct a systematic review, where researchers spend months or years gathering every relevant study on a topic, evaluating their quality using standardised criteria, and synthesising the findings. These systematic reviews typically take 12-18 months per topic because they involve extensive searching, detailed quality assessments, and careful analysis of sometimes contradictory results.
With potentially hundreds of activities to evaluate, we needed something faster but still rigorous. So, we built our own system, knowing it wouldn't be perfect but hoping it would be honest and useful.
Estimating each research paper’s quality
We calculate each paper’s evidence quality score out of 4 to provide an estimate for the quality of the research. This calculation incorporates type of study as the strongest influence on the score, complemented by what are known as “bibliometrics”, specifically journal impact and methodological rigour (more on these below).
Here's how it works in practice:
Type of Study carries the most weight. The scoring system we use is primarily based on widely recognised evidence hierarchies (see footnote if you want to understand more about evidence hierarchies in research). The more rigorous the type of study it is, the bigger its calculated quality score. Systematic reviews and meta-analyses get the highest scores (5) because they synthesise evidence from multiple studies. Randomised controlled trials (4) follow closely, valued for their ability to control variables and establish causality. Non-randomised trials like longitudinal studies (3), observational studies (2), or surveys (1) score lower due to limitations in controlling for confounding factors, though they still provide valuable real-world insights. Non-empirical papers like literature reviews are not included.
Journal reputation matters, but not too much. We include bibliometrics that measure journal quality and impact within their fields - Source Normalised Impact per Paper (SNIP) and Scientific Journal Ranking (SJR) scores - recognising that quality often varies across journals. However, we deliberately limit their influence by dividing these combined scores by 1.75 to prevent studies from ranking highly just because they appeared in prestigious journals (or vice versa).
Transparency gets rewarded. We also use the Rigor and Transparency Index (RTI; Menke et al., 2020) available through online methods review tool called SciScore. RTI evaluates the reporting standards of the journal of publication, particularly how rigorous, transparent and reproducible their published research typically is. For our calculation, papers published in the top performing 10% of journals get an RTI score of 2, top 25% a score of 1.75, and top 50% a score of 1.5. This RTI adjustment is divided by 1.5 (ensuring that the journals’ reputation for methodological transparency enhances but does not disproportionately impact the final quality score), then applied to the combined SNIP and SJR scores.
Finally, the overall score is divided by 2 and capped at a highest possible score of 4.
Paper Quality grades:
★☆☆☆: Preliminary
Typically exploratory studies or early-stage research with potential limitations in design or methodology. These papers are published in journals with less emphasis on research transparency and reproducibility practices and have below-average field-normalised citations, but may generate hypotheses for future research.
★★☆☆: Substantive
Well-conducted studies demonstrating good research practices and published in reputable peer-reviewed journals. These papers have average field-normalised citations and make clear contributions to their field, following sound methodological principles.
★★★☆: Robust
High-quality studies with strong research designs, published in well-respected journals with rigorous peer-review processes. These papers demonstrate above-average field-normalised citations, significantly contributing to their field with reliable and impactful findings.
★★★★: Comprehensive
Top-tier evidence syntheses, such as high-quality systematic reviews or meta-analyses, published in journals with strong reproducibility and transparency practices. These papers provide comprehensive analysis of existing evidence, have high field-normalised citation impact, and often guide clinical practice or policy.
Combining the evidence for each activity
After each individual paper’s quality is scored, we take the 10 papers with the highest quality scores that meet our inclusion criteria, and then factor in what the research actually showed (e.g., positive, negative, or no effects on wellbeing).
Direction of effect determines the final scores
We multiply the paper’s evidence quality score by the direction of the effects on wellbeing: +1 for positive wellbeing impacts, -1 for negative effects, and 0 for null, mixed or unclear findings.
These final direction x quality scores are combined, and divided by 10 to get the final activity evidence factor. Thus our final activity evidence score reflects both research quality and actual benefit.
What the scores actually revealed
Some findings aligned with what we expected. Cardio training, high-intensity interval training, and traditional practices like being in nature and yoga scored strongly, reflecting substantial research bases built over many years. But other results surprised us. Mantra practice scored much higher than expected, outperforming many other mindfulness practices that get more attention in popular wellness culture. Meanwhile, some more popular trends like decluttering showed weaker evidence than their cultural prominence might suggest. Of course, this doesn't mean these activities aren't valuable for individuals, just that high-quality research focused on their wellbeing benefits may not have been a focus in academic studies yet.
Current limitations in our system
We will always be transparent about our system's limitations, because understanding them helps users interpret the scores appropriately.
Most significantly, we don't yet account for effect sizes, which measure how big an impact an intervention actually has. A study showing massive improvements gets the same impact weighting as one showing tiny changes, as long as both are statistically significant. This means our scores reflect research quality but not necessarily the magnitude of benefits you might expect.
There's also potential bias toward established, high-impact journals. Innovative research published in newer or specialised venues might be undervalued, even if the methodology is solid. This could particularly affect emerging areas of wellness research where the most exciting work might appear in journals our system doesn't weight as highly.
We focus on research conducted with healthy adult populations (between 18-85 years old), which means sometimes excluding excellent research simply because it studied people with specific health conditions. Of note, we currently exclude research focussing on treating conditions like clinical depression and anxiety. This ensures broad applicability of our recommendations but might miss valuable insights about real benefits of these activities for challenges that a large number of us face throughout our lives.
Perhaps most importantly, we can only score research that exists. Some activities have limited evidence not because they don't work, but because there's rarely commercial incentive to fund expensive, large-scale trials on free-to-user practices. Unlike pharmaceutical research, where companies invest millions to prove their products work, wellness research often depends on academic interest and limited grant funding.
Our future vision
What excites us most is where this system is headed. We're working on version 2.0 to address current limitations through better automation, more research papers, more sophisticated use of AI, and integration of effect sizes to distinguish between modest and substantial impacts. We will also likely expand our parameters to include studies targeting prevalent conditions like clinical depression, anxiety and PTSD, which will open up a rich research literature.
But the bigger vision involves moving beyond just academic research. We're starting to collect real user outcome data, tracking how people actually feel before and after different activities. Over time, this will let us validate our research-based scores against real-world results and adjust recommendations where academic findings and lived experience diverge.
We might discover that activities with modest research scores work exceptionally well for certain people, or decipher the optimal practice levels for different activities, or that a certain combination of activities is key. Our hope is that this combination of academic rigour and real-world validation will create a more nuanced understanding of what works, when, and for whom.
The goal isn't perfect scores or complete certainty. It's building a system that honestly maps what we know, clearly marks what we don't, and evolves as both research and user experience teach us more.
The evidence factors you see in Bearmore represent this ongoing work. They'll continue evolving as new research emerges, our methods improve, and we learn more from people actually using these practices in their daily lives. Because ultimately, the best evidence combines what researchers discover in controlled studies with what people experience in their real, beautifully complicated lives.
We’re confident in our approach, but there’s always room to improve. Feedback makes us better, and we’re not too proud to admit when we’re wrong. If you have feedback, we’d love to hear from you. Get in touch here or join us on reddit to chat with the community.
Footnote:
Understanding evidence hierarchies
Before diving into our specific approach, it's helpful to understand how researchers typically think about study quality. Different types of studies are generally considered more or less reliable based on how well they can establish cause and effect.
At the top of the hierarchy are systematic reviews and meta-analyses. These don't conduct new research but instead gather all existing studies on a topic and analyse them together. Think of them as "studies of studies" that can spot patterns across multiple investigations and identify where the evidence is strongest.
Next come randomised controlled trials (RCTs), where researchers randomly assign participants to different groups to test a specific intervention. These are considered the gold standard for individual studies because using random assignment and control conditions helps to ensure that any effects found are due to the intervention, not other unaccounted for factors or placebo effects.
Observational studies sit lower on the hierarchy. These follow people over time or compare different groups without controlling what people do. While they can't prove cause and effect as clearly, they're valuable for understanding how things work in real-world conditions, and because they generally allow for larger and more diverse participant cohorts.
Each type of study has strengths and limitations, and ideally you want multiple types pointing toward the same conclusion. We wanted our scoring system to account for these differences while also considering other factors that reflect research quality.