Skip to content

Conversation

@taranehstrunk
Copy link

Implementation of Isotope Distribution Cache in OpenSWATHWorkflow

We noticed that in this workflow where in DIAScoring and DIAHelper we compute isotope Distributions, a huge part of the runtime is lost in the calculation estimateFromPeptideWeight(). (see the two "towers" in the flame graph below)
dia_before

In order to avoid doing this calculation so often, we implemented a Cache which we save as a member in each DIAScoring and DIAHelper. For this implementation, we used IsotopeDistributionCache.cpp and IsotopeDistributionCache.h. The needed feature was already implemented there, such that we just had to adapt the usage of this isotope_distribution_ member in each file and the corresponding calling functions.
This change in the workflow led to a good improvement in runtime.

Benchmarks

Without caching:
OpenSwathWorkflow took 15:38 m (CPU)
With IsotopDistributionCache:
OpenSwathWorkflow took 13:59 m (CPU)


#include <OpenMS/CHEMISTRY/AASequence.h>
#include <OpenMS/OPENSWATHALGO/DATAACCESS/DataStructures.h>
#include <OpenMS/FILTERING/DATAREDUCTION/IsotopeDistributionCache.h>
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to include the header here.
Just forward declare should be enough (to save on compile time).
class IsotopeDistributionCache; within the OpenMS namespace.

/// simulate spectrum from AASequence
OPENMS_DLLAPI void simulateSpectrumFromAASequence(const AASequence& aa,
OPENMS_DLLAPI void simulateSpectrumFromAASequence(IsotopeDistributionCache& iso,
const AASequence& aa,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is probably a bit more intuitive to swap argument 1 and 2... because AASequence is the primary input

#include <OpenMS/OPENSWATHALGO/DATAACCESS/TransitionExperiment.h>

#include <OpenMS/DATASTRUCTURES/DefaultParamHandler.h>
#include <OpenMS/ANALYSIS/OPENSWATH/DIAScoring.h>
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for this header.
forward declare IsotopeDistributionCache as before

//@}


void precalculateDistributionCache(Size num_begin, Size index);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you document what the function does and what the parameters mean?


void precalculateDistributionCache(Size num_begin, Size index);

void renormalize( TheoreticalIsotopePattern& isotopes, IsotopeDistribution& isotope_dist);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs here as well please

const TheoreticalIsotopePattern & getIsotopeDistribution(double mass) const;
const TheoreticalIsotopePattern& getIsotopeDistribution(double mass) ;

const IsotopeDistribution& getIntensity(double mass);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not really an intensity which is returned here, right?
Can you find a better name for the method?
Also document it

/// Vector of pre-calculated isotope distributions for several mass windows
std::vector<TheoreticalIsotopePattern> isotope_distributions_;

std::vector<IsotopeDistribution> distribution_cache_;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you document the four members?


double mass_window_width_;

double intensity_percentage_ ;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the percentage? (document)

CoarseIsotopePatternGenerator solver(nr_isotopes);
TheoreticalIsotopePattern isotopes;
auto d = solver.estimateFromPeptideWeight(product_mz * charge);
auto d = iso.getIntensity(product_mz * charge);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be const auto& d = .... to avoid the copy?


for (std::size_t i = 0; i < spec.size(); ++i)
{
std::vector<std::pair<double, double> > isotopes;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just as a suggestion for speed (even though this was not part of the PR):
move std::vector<std::pair<double, double> > isotopes; in front of the loop and just clear it inside to safe a lot of allocations.


void IsotopeDistributionCache::renormalize(
OpenMS::IsotopeDistributionCache::TheoreticalIsotopePattern& isotopes,
OpenMS::IsotopeDistribution& isotope_dist)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isotope_dist can be const, right because it is only read from?

auto d = solver.estimateFromPeptideWeight(0.5 * mass_window_width + index * mass_window_width);
auto d = solver.estimateFromPeptideWeight(0.5*mass_window_width_ + index * mass_window_width_);

distribution_cache_[index] = d;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are storing a copy of d in the cache, and then go on to trim d and compute isotope_distributions_[index] based on it. Is this intended? If so, please document this somewhere

//calculate index in the vector
Size index = static_cast<Size>(std::floor(mass / mass_window_width_));

Size index = Size (std::floor(mass/mass_window_width_));
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for floor(), since casting to Size will take care of it (for positive masses.. which assume we have)


//Return distribution
return isotope_distributions_[index];
return isotope_distributions_[(int)index];
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for (int)

//Return the intensity for a mass
const IsotopeDistribution& IsotopeDistributionCache::getIntensity(double mass)
{
Size index = Size (std::floor(mass/mass_window_width_));
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove floor

}

//Return intensity
return distribution_cache_[(int)index];
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove (int)

// Here we deal with SWATH files (can be multiple files)

DIAScoring dia;
auto isotope_cache = dia.getCache();
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you not simply instanciate an IsotopeDistributionCache isotope_cache; here directly?
(remove the DIAScoring.h in this case)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants